金融/语音/音频处理学术速递[10.20]

q-fin金融，共计0篇

cs.SD语音，共计19篇

eess.AS音频处理，共计19篇

1.q-fin金融:

2.cs.SD语音:

【1】 Continual self-training with bootstrapped remixing for speech enhancement 标题：用于语音增强的自举混音连续自我训练链接：https://arxiv.org/abs/2110.10103

作者：Efthymios Tzinis,Yossi Adi,Vamsi K. Ithapu,Buye Xu,Anurag Kumar 机构：University of Illinois at Urbana-Champaign,Facebook AI Research,Facebook Reality Labs Research 备注：Submitted to ICASSP 2022 摘要：我们提出了一种简单新颖的语音增强自监督训练方法RemixIT。该方法基于连续自训练方案，克服了以往研究的局限性，包括对域内噪声分布的假设和对干净目标信号的访问。具体地说，分离教师模型在域外数据集上预先训练，并用于推断一批域内混合的估计目标信号。接下来，我们通过使用置换估计的清洁和噪声信号生成人工混合来引导混合过程。最后，在使用最新的学生模型定期更新教师权重的同时，使用排列的估计源作为目标训练学生模型。我们的实验表明，在多个语音增强任务下，RemixIT的性能优于以前几种最先进的自监督方法。此外，RemixIT为语音增强任务的半监督和非监督域自适应提供了一个无缝的替代方案，同时具有足够的通用性，可以应用于任何分离任务，并与任何分离模型相匹配。摘要：We propose RemixIT, a simple and novel self-supervised training method for speech enhancement. The proposed method is based on a continuously self-training scheme that overcomes limitations from previous studies including assumptions for the in-domain noise distribution and having access to clean target signals. Specifically, a separation teacher model is pre-trained on an out-of-domain dataset and is used to infer estimated target signals for a batch of in-domain mixtures. Next, we bootstrap the mixing process by generating artificial mixtures using permuted estimated clean and noise signals. Finally, the student model is trained using the permuted estimated sources as targets while we periodically update teacher's weights using the latest student model. Our experiments show that RemixIT outperforms several previous state-of-the-art self-supervised methods under multiple speech enhancement tasks. Additionally, RemixIT provides a seamless alternative for semi-supervised and unsupervised domain adaptation for speech enhancement tasks, while being general enough to be applied to any separation task and paired with any separation model.

【2】 Temporal separation of whale vocalizations from background oceanic noise using a power calculation 标题：基于功率计算的鲸鱼发声与背景海洋噪声的时间分离链接：https://arxiv.org/abs/2110.10010

作者：Jacques van Wyk,Jaco Versfeld,Johan du Preez 机构：Department of Electrical and Electronic Engineering, University of Stellenbosch, Stellenbosch, South Africa, ! 摘要：在许多情况下，分析声音信号以搜索鲸目动物发声的过程是一项非常艰巨的任务，需要许多复杂的计算、过多的数字处理技术以及对声音信号进行仔细检查以确定发声的位置。为了简化这一过程，本文借助于平稳高斯噪声信号的稳健功率计算和确定给定样本帧的均值和方差的递归方法，开发了一种计算效率高且抗噪声的方法，用于确定音频段是否包含潜在的鲸目动物呼叫。由此产生的探测器在包含南露脊鲸声音的录音上进行测试，并将其性能与现有的当代能量探测器进行比较。该检测器在中高信噪比下表现出良好的性能。该探测器易于实现，计算效率高，使用可靠，能够在嘈杂的水下环境中准确检测鲸鱼的叫声。摘要：The process of analyzing audio signals in search of cetacean vocalizations is in many cases a very arduous task, requiring many complex computations, a plethora of digital processing techniques and the scrutinization of an audio signal with a fine comb to determine where the vocalizations are located. To ease this process, a computationally efficient and noise-resistant method for determining whether an audio segment contains a potential cetacean call is developed here with the help of a robust power calculation for stationary Gaussian noise signals and a recursive method for determining the mean and variance of a given sample frame. The resulting detector is tested on audio recordings containing Southern Right whale sounds and its performance compared to that of an existing contemporary energy detector. The detector exhibits good performance at moderate-to-high signal-to-noise ratio values. The detector succeeds in being easy to implement, computationally efficient to use and robust enough to accurately detect whale vocalizations in a noisy underwater environment.

【3】 Speech Pattern based Black-box Model Watermarking for Automatic Speech Recognition 标题：基于语音模式的自动语音识别黑盒模型水印链接：https://arxiv.org/abs/2110.09814

作者：Haozhe Chen,Weiming Zhang,Kunlin Liu,Kejiang Chen,Han Fang,Nenghai Yu 机构：University of Science and Technology of China 备注：5 pages, 2 figures 摘要：模型水印技术作为一种有效的知识产权保护方法，已被广泛应用于各种深度神经网络（DNN），包括语音分类模型。然而，如何为自动语音识别（ASR）模型设计一个黑盒水印方案仍然是一个尚未解决的问题，这是保护部署在云服务器上的远程ASR应用程序编程接口（API）的重要需求。由于ASR模型的条件独立性假设和基于标签检测的规避攻击风险，用于语音分类模型的黑盒模型水印方案不能应用于ASR模型。在本文中，我们提出了第一个用于保护ASR模型IP的黑盒模型水印框架。具体来说，我们通过将模型拥有者的语音片段传播到整个输入音频上，并使用隐藏文本标记触发音频，从而通过语言隐藏隐藏作者信息来合成触发音频。在最先进的开源ASR系统DeepSpeech上的实验证明了所提出的水印方案的可行性，该方案对五种攻击都具有鲁棒性，并且对准确性影响很小。摘要：As an effective method for intellectual property (IP) protection, model watermarking technology has been applied on a wide variety of deep neural networks (DNN), including speech classification models. However, how to design a black-box watermarking scheme for automatic speech recognition (ASR) models is still an unsolved problem, which is a significant demand for protecting remote ASR Application Programming Interface (API) deployed in cloud servers. Due to conditional independence assumption and label-detection-based evasion attack risk of ASR models, the black-box model watermarking scheme for speech classification models cannot apply to ASR models. In this paper, we propose the first black-box model watermarking framework for protecting the IP of ASR models. Specifically, we synthesize trigger audios by spreading the speech clips of model owners over the entire input audios and labeling the trigger audios with the stego texts, which hides the authorship information with linguistic steganography. Experiments on the state-of-the-art open-source ASR system DeepSpeech demonstrate the feasibility of the proposed watermarking scheme, which is robust against five kinds of attacks and has little impact on accuracy.

【4】 SSAST: Self-Supervised Audio Spectrogram Transformer 标题：SSAST：自监督音频频谱转换器链接：https://arxiv.org/abs/2110.09784

作者：Yuan Gong,Cheng-I Jeff Lai,Yu-An Chung,James Glass 机构：MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 摘要：最近，纯粹基于自我注意的神经网络，如视觉变换器（ViT），在各种视觉任务上表现出优于卷积神经网络（CNN）构建的深度学习模型，从而将最初为语言处理开发的变换器的成功扩展到视觉领域。最近的一项研究表明，类似的方法也可以应用于音频领域。具体而言，音频频谱图转换器（AST）在各种音频分类基准上实现了最先进的结果。然而，与CNN相比，纯Transformer模型往往需要更多的训练数据，AST的成功依赖于有监督的预训练，这需要大量标记数据和复杂的训练管道，从而限制了AST的实际使用。本文主要研究音频和语音分类，旨在通过利用未标记数据的自监督学习来缓解AST的数据需求问题。具体地说，我们建议使用来自AudioSet和Librispeech的未标记音频，使用联合鉴别和生成掩蔽频谱图面片建模（MSPM）对AST模型进行预训练。我们在音频和语音分类任务（包括音频事件分类、关键词识别、情感识别和说话人识别）上评估我们的预训练模型。提出的自我监督框架显著提高了AST在所有任务上的性能，平均提高了60.9%，与有监督的预训练AST相比，结果相似甚至更好。据我们所知，它是音频和语音领域第一个基于补丁的自监督学习框架，也是第一个用于AST的自监督学习框架。摘要：Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST. This paper focuses on audio and speech classification, and aims to alleviate the data requirement issues with the AST by leveraging self-supervised learning using unlabeled data. Specifically, we propose to pretrain the AST model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio from AudioSet and Librispeech. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.

【5】 Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation 标题：基于SUS约束的VAE和文本编码器聚合改进情感语音合成链接：https://arxiv.org/abs/2110.09780

作者：Fengyu Yang,Jian Luan,Yujun Wang 机构：Xiaomi Corporation, Beijing, China 备注：submitted to ICASSP2022 摘要：从参考音频中学习情感嵌入是编解码系统中多情感语音合成的一种简单方法。但是，如何更好地嵌入情感以及如何更有效地将情感注入TTS声学模型仍在研究中。在本文中，我们提出了一个创新的约束，以帮助VAE提取具有更好聚类凝聚力的情感嵌入。此外，将获得的情感嵌入作为查询，通过注意聚合所有编码层的潜在表示。此外，来自编码器层本身的查询也很有用。实验证明，所提出的方法可以增强综合句法和语义信息的编码，产生更具表现力的情感语音。摘要：Learning emotion embedding from reference audio is a straightforward approach for multi-emotion speech synthesis in encoder-decoder systems. But how to get better emotion embedding and how to inject it into TTS acoustic model more effectively are still under investigation. In this paper, we propose an innovative constraint to help VAE extract emotion embedding with better cluster cohesion. Besides, the obtained emotion embedding is used as query to aggregate latent representations of all encoder layers via attention. Moreover, the queries from encoder layers themselves are also helpful. Experiments prove the proposed methods can enhance the encoding of comprehensive syntactic and semantic information and produce more expressive emotional speech.

【6】 Rep Works in Speaker Verification 标题：代表在说话人验证中工作链接：https://arxiv.org/abs/2110.09720

作者：Yufeng Ma,Miao Zhao,Yiwei Ding,Yu Zheng,Min Liu,Minqiang Xu 机构： SpeakIn Technologies Co. Ltd., Fudan University 备注：submitted to ICASSP 2022 摘要：多分支卷积神经网络结构在说话人确认中引起了广泛的关注，因为多个并行分支的聚合可以显著提高说话人确认的性能。然而，由于模型参数的增加和额外的操作，这种设计在推理过程中效率不够。在本文中，我们提出了一种新的多分支网络结构RepSPKNet，它使用了重新参数化技术。利用这种技术，我们的主干模型包含一个有效的类似VGG的推理状态，而其训练状态是一个复杂的多分支结构。我们首先将RepVGG的具体结构引入到说话人验证中，并提出了这种结构的几种变体。在基于VoxCeleb的测试集上评估性能。我们证明了分支多样性和分支容量在RepSPKNet设计中都起着重要作用。我们的RepSPKNet在VoxCeleb1-H上的EER为1.5982%，minDCF为0.1374，达到了最先进的性能。摘要：Multi-branch convolutional neural network architecture has raised lots of attention in speaker verification since the aggregation of multiple parallel branches can significantly improve performance. However, this design is not efficient enough during the inference time due to the increase of model parameters and extra operations. In this paper, we present a new multi-branch network architecture RepSPKNet that uses a re-parameterization technique. With this technique, our backbone model contains an efficient VGG-like inference state while its training state is a complicated multi-branch structure. We first introduce the specific structure of RepVGG into speaker verification and propose several variants of this structure. The performance is evaluated on VoxCeleb-based test sets. We demonstrate that both the branch diversity and the branch capacity play important roles in RepSPKNet designing. Our RepSPKNet achieves state-of-the-art performance with a 1.5982% EER and a 0.1374 minDCF on VoxCeleb1-H.

【7】 Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge 标题：神经词典阅读器：通过利用外部文本知识减少端到端TTS中的发音错误链接：https://arxiv.org/abs/2110.09698

作者：Mutian He,Jingzhou Yang,Lei He,Frank K. Soong 机构：⋆, The Hong Kong University of Science and Technology, Microsoft China 备注：5 pages, 3 figures 摘要：端到端TTS的数据要求很高，因为昂贵的语音语料库很难覆盖所有必要的知识，神经模型也很难学习这些知识，因此需要手动注入额外的知识。例如，为了获取没有规则正字法的语言的发音知识，需要基于结构化的大型发音词典构建复杂的字形到音素管道，从而导致将神经TTS扩展到此类语言的额外成本，有时甚至很高。在本文中，我们提出了一个框架来学习使用Token2kKnowledge注意模块从非结构化外部资源中提取知识。该框架被用于构建一个新的端到端TTS模型，名为神经词汇阅读器，该模型从原始词汇文本中提取语音。实验结果表明，该模型能够显著减少低资源端到端汉语语音转换中的发音错误，并且词汇阅读能力可以用较少的数据量转移到其他语言。摘要：End-to-end TTS suffers from high data requirements as it is difficult for both costly speech corpora to cover all necessary knowledge and neural models to learn the knowledge, hence additional knowledge needs to be injected manually. For example, to capture pronunciation knowledge on languages without regular orthography, a complicated grapheme-to-phoneme pipeline needs to be built based on a structured, large pronunciation lexicon, leading to extra, sometimes high, costs to extend neural TTS to such languages. In this paper, we propose a framework to learn to extract knowledge from unstructured external resources using Token2Knowledge attention modules. The framework is applied to build a novel end-to-end TTS model named Neural Lexicon Reader that extracts pronunciations from raw lexicon texts. Experiments support the potential of our framework that the model significantly reduces pronunciation errors in low-resource, end-to-end Chinese TTS, and the lexicon-reading capability can be transferred to other languages with a smaller amount of data.

【8】 Neural Synthesis of Footsteps Sound Effects with Generative Adversarial Networks 标题：基于产生式对抗性网络的脚步声效果神经合成链接：https://arxiv.org/abs/2110.09605

作者：Marco Comunità,Huy Phan,Joshua D. Reiss 机构：Centre for Digital Music, Queen Mary University of London, UK 摘要：足迹是多媒体应用中最普遍的声音效果之一。对于理解声音特征和开发足迹声音效果的合成模型进行了大量研究。在本文中，我们首次尝试采用神经合成来完成这项任务。我们实现了两种基于GAN的结构，并将结果与真实记录以及六种传统的声音合成方法进行了比较。我们的体系结构达到了与记录样本一样高的真实性分数，显示了手头任务的令人鼓舞的结果。摘要：Footsteps are among the most ubiquitous sound effects in multimedia applications. There is substantial research into understanding the acoustic features and developing synthesis models for footstep sound effects. In this paper, we present a first attempt at adopting neural synthesis for this task. We implemented two GAN-based architectures and compared the results with real recordings as well as six traditional sound synthesis methods. Our architectures reached realism scores as high as recorded samples, showing encouraging results for the task at hand.

【9】 Who calls the shots? Rethinking Few-Shot Learning for Audio 标题：谁说了算？关于音频Few-Shot学习的再思考链接：https://arxiv.org/abs/2110.09600

作者：Yu Wang,Nicholas J. Bryan,Justin Salamon,Mark Cartwright,Juan Pablo Bello 机构： Music and Audio Research Laboratory, New York University, New York, NY, USA, Adobe Research, San Francisco, CA, USA, New Jersey Institute of Technology, Newark, NJ, USA 备注：WASPAA 2021 摘要：“Few-Shot学习”旨在训练模型，只要给出少量标记示例（称为支持集），就可以识别新类。虽然近年来该领域取得了显著的进展，但他们通常将重点放在多类图像分类上。相比之下，由于声音重叠，音频通常是多标签的，从而产生独特的特性，如复调和信噪比（SNR）。这导致了关于此类音频属性可能对少数镜头学习系统设计、性能和人机交互产生的影响的未回答问题，因为通常由用户收集和提供推理时间支持集示例。我们通过一系列旨在阐明这些问题答案的实验来解决这些问题。我们介绍了两个新的数据集，FSD-MIX-CLIPS和FSD-MIX-SED，它们的程序生成允许我们系统地探索这些问题。我们的实验得出了关于少数镜头学习的音频特定见解，其中一些与图像领域的最新发现不一致：没有最佳一刀切的模型、方法和支持集选择标准。相反，这取决于预期的应用程序场景。我们的代码和数据可在https://github.com/wangyu/rethink-audio-fsl. 摘要：Few-shot learning aims to train models that can recognize novel classes given just a handful of labeled examples, known as the support set. While the field has seen notable advances in recent years, they have often focused on multi-class image classification. Audio, in contrast, is often multi-label due to overlapping sounds, resulting in unique properties such as polyphony and signal-to-noise ratios (SNR). This leads to unanswered questions concerning the impact such audio properties may have on few-shot learning system design, performance, and human-computer interaction, as it is typically up to the user to collect and provide inference-time support set examples. We address these questions through a series of experiments designed to elucidate the answers to these questions. We introduce two novel datasets, FSD-MIX-CLIPS and FSD-MIX-SED, whose programmatic generation allows us to explore these questions systematically. Our experiments lead to audio-specific insights on few-shot learning, some of which are at odds with recent findings in the image domain: there is no best one-size-fits-all model, method, and support set selection criterion. Rather, it depends on the expected application scenario. Our code and data are available at https://github.com/wangyu/rethink-audio-fsl.

【10】 Adversarial Domain Adaptation with Paired Examples for Acoustic Scene Classification on Different Recording Devices 标题：基于配对实例的对抗性领域自适应在不同记录设备上的声场分类链接：https://arxiv.org/abs/2110.09598

作者：Stanisław Kacprzak,Konrad Kowalczyk 机构：AGH University of Science and Technology, Institute of Electronics, -, Krakow, Poland 备注：Accepted for publication in the Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 2021 摘要：在分类任务中，当数据收集在不同的领域时，分类精度会降低。为了解决这个问题，在本文中，我们研究了几种域自适应（DA）对抗模型及其对声学场景分类任务的影响。所研究的模型包括几种具有不同损耗函数的生成性对抗网络（GAN）和由两个互连GAN模型组成的所谓周期GAN。实验在DCASE20挑战任务1A数据集上进行，在该数据集中，我们可以利用使用不同设备记录的成对数据示例，即源域和目标域记录。进行的实验结果表明，使用周期GAN可以获得最佳性能的域适配，目标域器件的精度相对提高了66%，而源域器件的精度相对降低了6%。此外，通过使用配对数据示例，我们能够提高使用较大的未配对数据集训练的模型的总体精度，同时降低模型训练的计算成本。摘要：In classification tasks, the classification accuracy diminishes when the data is gathered in different domains. To address this problem, in this paper, we investigate several adversarial models for domain adaptation (DA) and their effect on the acoustic scene classification task. The studied models include several types of generative adversarial networks (GAN), with different loss functions, and the so-called cycle GAN which consists of two interconnected GAN models. The experiments are performed on the DCASE20 challenge task 1A dataset, in which we can leverage the paired examples of data recorded using different devices, i.e., the source and target domain recordings. The results of performed experiments indicate that the best performing domain adaptation can be obtained using the cycle GAN, which achieves as much as 66% relative improvement in accuracy for the target domain device, while only 6% relative decrease in accuracy on the source domain. In addition, by utilizing the paired data examples, we are able to improve the overall accuracy over the model trained using larger unpaired data set, while decreasing the computational cost of the model training.

【11】 Chunked Autoregressive GAN for Conditional Waveform Synthesis 标题：用于条件波形合成的分块自回归GAN 链接：https://arxiv.org/abs/2110.10139

作者：Max Morrison,Rithesh Kumar,Kundan Kumar,Prem Seetharaman,Aaron Courville,Yoshua Bengio 机构：Northwestern University, Descript, Inc., Mila, Qu´ebec Artificial Intelligence Institute, Universit´e de Montr´eal, CIFAR Fellow, CIFAR Program Co-director 备注：Under review as a conference paper at ICLR 2022 摘要：条件波形合成模型学习给定条件下的音频波形分布，如文本、mel频谱图或MIDI。这些系统采用深度生成模型，通过顺序（自回归）或并行（非自回归）采样对波形进行建模。生成对抗网络（GAN）已成为非自回归波形合成的常用选择。然而，最先进的GAN模型在执行mel谱图反演时会产生伪影。在本文中，我们证明了这些伪影与生成器无法学习精确的基音和周期性相对应。我们表明，与使用自回归相比，简单的基音和周期调节不足以减少这种误差。我们讨论了自回归为学习瞬时频率和相位之间的关系提供的感应偏差，并表明，即使在每次向前传递期间对大块波形进行自回归采样时，这种感应偏差仍然有效。相对于现有的基于GAN的模型，我们提出的分块自回归GAN（CARGAN）模型将基音误差降低了40-60%，将训练时间减少了58%，保持了适合实时或交互式应用的快速生成速度，并保持或提高了主观质量。摘要：Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. These systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. However, state-of-the-art GAN-based models produce artifacts when performing mel-spectrogram inversion. In this paper, we demonstrate that these artifacts correspond with an inability for the generator to learn accurate pitch and periodicity. We show that simple pitch and periodicity conditioning is insufficient for reducing this error relative to using autoregression. We discuss the inductive bias that autoregression provides for learning the relationship between instantaneous frequency and phase, and show that this inductive bias holds even when autoregressively sampling large chunks of the waveform during each forward pass. Relative to prior state-of- the-art GAN-based models, our proposed model, Chunked Autoregressive GAN (CARGAN) reduces pitch error by 40-60%, reduces training time by 58%, maintains a fast generation speed suitable for real-time or interactive applications, and maintains or improves subjective quality.

【12】 Private Language Model Adaptation for Speech Recognition 标题：用于语音识别的私有语言模型自适应链接：https://arxiv.org/abs/2110.10026

作者：Zhe Liu,Ke Li,Shreyan Bakshi,Fuchun Peng 机构：Facebook, Menlo Park, CA, USA 摘要：语音模型自适应对于处理服务器端代理训练数据与用户本地设备上接收的实际数据之间的差异至关重要。通过使用联邦学习（FL），我们介绍了一种在专用设备上持续自适应神经网络语言模型（NNLMs）的有效方法，并将其应用于自动语音识别（ASR）。为了解决设备上训练语料库中潜在的语音转录错误，我们进行了实证研究，比较了在FL设置中利用标记水平置信分数来提高NNLM质量的各种策略。实验表明，与无模型自适应方法相比，该方法在两个语音评价数据集上分别实现了相对2.6%和10.8%的字错误率降低。我们还提供分析，以评估我们提出的程序的隐私保障。摘要：Speech model adaptation is crucial to handle the discrepancy between server-side proxy training data and actual data received on users' local devices. With the use of federated learning (FL), we introduce an efficient approach on continuously adapting neural network language models (NNLMs) on private devices with applications on automatic speech recognition (ASR). To address the potential speech transcription errors in the on-device training corpus, we perform empirical studies on comparing various strategies of leveraging token-level confidence scores to improve the NNLM quality in the FL settings. Experiments show that compared with no model adaptation, the proposed method achieves relative 2.6% and 10.8% word error rate (WER) reductions on two speech evaluation datasets, respectively. We also provide analysis in evaluating privacy guarantees of our presented procedure.

【13】 The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks 标题：鸡尾酒叉子问题：现实世界配乐的三条主干音频分离链接：https://arxiv.org/abs/2110.09958

作者：Darius Petermann,Gordon Wichern,Zhong-Qiu Wang,Jonathan Le Roux 机构：Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA, Indiana University, Department of Intelligent Systems Engineering, Bloomington, IN, USA 备注：Submitted to ICASSP2022. For resources and examples, see this https URL 摘要：鸡尾酒会问题的目的是在复杂的声学场景中隔离任何感兴趣的源，并且长期以来激发了音频源分离的研究。最近的工作主要集中在将语音与噪音、语音与语音、乐器与乐器或声音事件彼此分离。然而，将音频混合（例如电影配乐）分为语音、音乐和音效三大类（此处理解为包括环境噪声和自然声音事件）在很大程度上尚未得到探索，尽管有广泛的潜在应用。本文将此任务形式化为鸡尾酒叉问题，并提出了Divide-and-Remaster（DnR）数据集来促进此主题的研究。DnR由三个成熟的音频数据集（LibriVox、FMA、FSD50k）构建而成，在源重叠和相对响度方面注意再现与专业制作的内容相似的条件，并以CD质量提供。我们在DnR上测试标准声源分离算法，并进一步引入一种新的混合STFT分辨率模型，以更好地解决三种声源类型的各种声学特性。我们的最佳模型在音乐11.3 dB、语音11.8 dB和音效10.9 dB的混合情况下产生SI-SDR改进。摘要：The cocktail party problem aims at isolating any source of interest within a complex acoustic scene, and has long inspired audio source separation research. Recent efforts have mainly focused on separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. However, separating an audio mixture (e.g., movie soundtrack) into the three broad categories of speech, music, and sound effects (here understood to include ambient noise and natural sound events) has been left largely unexplored, despite a wide range of potential applications. This paper formalizes this task as the cocktail fork problem, and presents the Divide and Remaster (DnR) dataset to foster research on this topic. DnR is built from three well-established audio datasets (LibriVox, FMA, FSD50k), taking care to reproduce conditions similar to professionally produced content in terms of source overlap and relative loudness, and made available at CD quality. We benchmark standard source separation algorithms on DnR, and further introduce a new mixed-STFT-resolution model to better address the variety of acoustic characteristics of the three source types. Our best model produces SI-SDR improvements over the mixture of 11.3 dB for music, 11.8 dB for speech, and 10.9 dB for sound effects.

【14】 Speech Representation Learning Through Self-supervised Pretraining And Multi-task Finetuning 标题：基于自监督预训练和多任务精调的语音表征学习链接：https://arxiv.org/abs/2110.09930

作者：Yi-Chen Chen,Shu-wen Yang,Cheng-Kuang Lee,Simon See,Hung-yi Lee 机构：National Taiwan University, Taiwan, NVIDIA AI Technology Center, NVIDIA 摘要：语音表征学习在语音处理中起着至关重要的作用。其中，自监督学习（SSL）已成为一个重要的研究方向。结果表明，SSL预训练模型可以在语音处理的各种下游任务中获得优异的性能。另一方面，有监督多任务学习（MTL）是另一种表征学习范式，在计算机视觉（CV）和自然语言处理（NLP）中已被证明是有效的。然而，在语音处理中，对于有监督的MTL训练的一般表征学习模型还没有系统的研究。在本文中，我们证明了MTL微调可以进一步改善SSL预训练。我们分析了有监督的MTL微调的可推广性，以检验通过MTL微调学习的语音表示是否可以推广到看不见的新任务。摘要：Speech representation learning plays a vital role in speech processing. Among them, self-supervised learning (SSL) has become an important research direction. It has been shown that an SSL pretraining model can achieve excellent performance in various downstream tasks of speech processing. On the other hand, supervised multi-task learning (MTL) is another representation learning paradigm, which has been proven effective in computer vision (CV) and natural language processing (NLP). However, there is no systematic research on the general representation learning model trained by supervised MTL in speech processing. In this paper, we show that MTL finetuning can further improve SSL pretraining. We analyze the generalizability of supervised MTL finetuning to examine if the speech representation learned by MTL finetuning can generalize to unseen new tasks.

【15】 CycleFlow: Purify Information Factors by Cycle Loss 标题：CycleFlow：通过周期损耗净化信息因素链接：https://arxiv.org/abs/2110.09928

作者：Haoran Sun,Chen Chen,Lantian Li,Dong Wang 机构：Center for Speech and Language Technologies, BNRist, Tsinghua University, China, Department of Computer Science and Technology, Tsinghua University, China 备注：Submitted to ICASSP 2022 摘要：SpeechFlow是一种基于信息瓶颈（IB）的功能强大的因子分解模型，其有效性已被多个研究报告。然而，SpeechFlow的一个潜在问题是，如果IB通道设计得不好，那么产生的因素就无法很好地分解。在本研究中，我们提出了一个结合随机因子替代和循环损失的循环流模型来解决这个问题。对语音转换任务的实验表明，这种简单的技术可以有效地减少各个因素之间的互信息，并产生比基于IB的SpeechFlow更好的转换效果。CycleFlow也可以用作语音编辑的强大工具。我们通过情绪感知实验证明了这种用法。摘要：SpeechFlow is a powerful factorization model based on information bottleneck (IB), and its effectiveness has been reported by several studies. A potential problem of SpeechFlow, however, is that if the IB channels are not well designed, the resultant factors cannot be well disentangled. In this study, we propose a CycleFlow model that combines random factor substitution and cycle loss to solve this problem. Experiments on voice conversion tasks demonstrate that this simple technique can effectively reduce mutual information among individual factors, and produce clearly better conversion than the IB-based SpeechFlow. CycleFlow can also be used as a powerful tool for speech editing. We demonstrate this usage by an emotion perception experiment.

【16】 Speech Enhancement Based on Cyclegan with Noise-informed Training 标题：基于带噪声信息训练的Cyclegan语音增强算法链接：https://arxiv.org/abs/2110.09924

作者：Wen-Yuan Ting,Syu-Siang Wang,Hsin-Li Chang,Borching Su,Yu Tsao 机构：Graduate Institute of Communication Engineering, National Taiwan University, Taipei, Taiwan, Department of Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan, Department of Electrical Engineering, National Central University, Taoyuan, Taiwan 摘要：语音增强（SE）方法可分为有监督和无监督两类。对于无监督SE，一个著名的周期一致性生成对抗网络（CycleGAN）模型（由两个生成器和两个鉴别器组成）被证明具有强大的非线性映射能力，从而实现了一种有前途的噪声抑制能力。然而，低效率的训练过程以及噪声和干净语音之间的知识不足可能会限制CycleGAN SE在运行时的增强性能。在这项研究中，我们提出了一种新的噪声感知训练CycleGAN方法，该方法将额外的输入合并到生成器和鉴别器中，以帮助CycleGAN学习噪声域和干净域之间更精确的语音信号转换。附加输入功能用作指示器，在CycleGAN训练阶段提供更多信息。实验结果表明，该方法可以改善CycleGAN-SE模型，同时获得更好的音质和更少的信号失真。摘要：Speech enhancement (SE) approaches can be classified into supervised and unsupervised categories. For unsupervised SE, a well-known cycle-consistent generative adversarial network (CycleGAN) model, which comprises two generators and two discriminators, has been shown to provide a powerful nonlinear mapping ability and thus achieve a promising noise-suppression capability. However, a low-efficiency training process along with insufficient knowledge between noisy and clean speech may limit the enhancement performance of the CycleGAN SE at runtime. In this study, we propose a novel noise-informed-training CycleGAN approach that incorporates additional inputs into the generators and discriminators to assist the CycleGAN in learning a more accurate transformation of speech signals between the noise and clean domains. The additional input feature serves as an indicator that provides more information during the CycleGAN training stage. Experiment results confirm that the proposed approach can improve the CycleGAN SE model while achieving a better sound quality and fewer signal distortions.

【17】 Speech Enhancement-assisted Stargan Voice Conversion in Noisy Environments 标题：噪声环境下语音增强辅助的Stargan语音转换链接：https://arxiv.org/abs/2110.09923

作者：Yun-Ju Chan,Chiang-Jen Peng,Syu-Siang Wang,Hsin-Min Wang,Yu Tsao,Tai-Shih Chi 机构：Department of Electrical and Computer Engineering, National Yang Ming Chiao Tung University, R.O.C, Department of Electrical Engineering, Yuan Ze University, R.O.C, Institute of Information Science, Academia Sinica, R.O.C 摘要：许多语音转换（VC）技术被提出用于不同说话人之间的语音转换。虽然在干净的环境中使用VC可以观察到转换语音的良好质量，但当系统在噪声条件下运行时，转换语音的质量会急剧下降。为了解决这个问题，我们提出了一种新的基于增强的StarGAN（E-StarGAN）VC系统，该系统利用语音增强（SE）技术进行信号预处理。SE系统通常用于减少噪声语音中的噪声成分，并为下游应用任务生成增强语音。因此，我们研究了结合VC和SE的E-StarGAN的有效性，并证明了该方法在各种噪声环境下的鲁棒性。在普通话数据集上进行的VC实验结果表明，当与SE结合时，所提出的E-StarGAN VC模型对未知噪声具有鲁棒性。此外，主观听力测试结果表明，所提出的E-StarGAN模型可以改善由噪声污染源话语转换而来的语音信号的音质。摘要：Numerous voice conversion (VC) techniques have been proposed for the conversion of voices among different speakers. Although the decent quality of converted speech can be observed when VC is applied in a clean environment, the quality will drop sharply when the system is running under noisy conditions. In order to address this issue, we propose a novel enhancement-based StarGAN (E-StarGAN) VC system, which leverages a speech enhancement (SE) technique for signal pre-processing. SE systems are generally used to reduce noise components in noisy speech and to generate enhanced speech for downstream application tasks. Therefore, we investigated the effectiveness of E-StarGAN, which combines VC and SE, and demonstrated the robustness of the proposed approach in various noisy environments. The results of VC experiments conducted on a Mandarin dataset show that when combined with SE, the proposed E-StarGAN VC model is robust to unseen noises. In addition, the subjective listening test results show that the proposed E-StarGAN model can improve the sound quality of speech signals converted from noise-corrupted source utterances.

【18】 Multi-Modal Pre-Training for Automated Speech Recognition 标题：自动语音识别中的多模态预训练链接：https://arxiv.org/abs/2110.09890

作者：David M. Chan,Shalini Ghosh,Debmalya Chakrabarty,Björn Hoffmeister 机构：⋆ University of California, Berkeley (EECS), † Amazon Alexa AI 摘要：传统上，自动语音识别的研究主要集中在音频表征的局部优先编码，以预测话语中的语音音素。不幸的是，依赖这种超局部信息的方法往往容易受到局部级损坏（如音频帧丢失或噪音）和全局级噪音（如环境噪音或背景噪音）的影响，而这些噪音在训练过程中是看不到的。在这项工作中，我们介绍了一种新的方法，它利用一种基于掩蔽语言建模的自监督学习技术来计算话语发生环境的全局、多模态编码。然后，我们使用一个新的深度融合框架将这种全局上下文集成到传统的ASR方法中，并证明了由此产生的方法在Librispeech上可以比基线方法高出7%；内部数据集的收益范围从6%（大型模型）到45%（小型模型）。摘要：Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise) that has not been seen during training. In this work, we introduce a novel approach which leverages a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which the utterance occurs. We then use a new deep-fusion framework to integrate this global context into a traditional ASR method, and demonstrate that the resulting method can outperform baseline methods by up to 7% on Librispeech; gains on internal datasets range from 6% (on larger models) to 45% (on smaller models).

【19】 Personalized Speech Enhancement: New Models and Comprehensive Evaluation 标题：个性化语音增强：新模型与综合评价链接：https://arxiv.org/abs/2110.09625

作者：Sefik Emre Eskimez,Takuya Yoshioka,Huaming Wang,Xiaofei Wang,Zhuo Chen,Xuedong Huang 机构：Microsoft, One Microsoft Way, Redmond, WA, USA 摘要：个性化语音增强（PSE）模型利用额外的线索，如d向量等说话人嵌入，实时去除背景噪声和干扰语音，从而改善各种声学场景下在线视频会议系统的语音质量。在这项工作中，我们提出了两种用于PSE的神经网络，其性能优于先前提出的语音滤波器。此外，我们还创建了测试集，用于捕获用户在视频会议期间可能遇到的各种场景。此外，我们提出了一个新的指标来衡量目标说话人过度抑制（TSOS）问题，尽管它在部署中至关重要，但之前没有得到足够的研究。此外，我们还提出了语音识别后端的多任务训练。我们的结果表明，与基线模型相比，所提出的模型可以获得更好的语音识别精度、语音清晰度和感知质量，并且多任务训练除了可以提高语音识别精度外，还可以缓解TSOS问题。摘要：Personalized speech enhancement (PSE) models utilize additional cues, such as speaker embeddings like d-vectors, to remove background noise and interfering speech in real-time and thus improve the speech quality of online video conferencing systems for various acoustic scenarios. In this work, we propose two neural networks for PSE that achieve superior performance to the previously proposed VoiceFilter. In addition, we create test sets that capture a variety of scenarios that users can encounter during video conferencing. Furthermore, we propose a new metric to measure the target speaker over-suppression (TSOS) problem, which was not sufficiently investigated before despite its critical importance in deployment. Besides, we propose multi-task training with a speech recognition back-end. Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models, and the multi-task training can alleviate the TSOS issue in addition to improving the speech recognition accuracy.

3.eess.AS音频处理:

【1】 Chunked Autoregressive GAN for Conditional Waveform Synthesis 标题：用于条件波形合成的分块自回归GAN 链接：https://arxiv.org/abs/2110.10139

【2】 Private Language Model Adaptation for Speech Recognition 标题：用于语音识别的私有语言模型自适应链接：https://arxiv.org/abs/2110.10026

【3】 The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks 标题：鸡尾酒叉子问题：现实世界配乐的三条主干音频分离链接：https://arxiv.org/abs/2110.09958

【4】 Speech Representation Learning Through Self-supervised Pretraining And Multi-task Finetuning 标题：基于自监督预训练和多任务精调的语音表征学习链接：https://arxiv.org/abs/2110.09930

【5】 CycleFlow: Purify Information Factors by Cycle Loss 标题：CycleFlow：通过周期损耗净化信息因素链接：https://arxiv.org/abs/2110.09928

【6】 Speech Enhancement Based on Cyclegan with Noise-informed Training 标题：基于带噪声信息训练的Cyclegan语音增强算法链接：https://arxiv.org/abs/2110.09924

【7】 Speech Enhancement-assisted Stargan Voice Conversion in Noisy Environments 标题：噪声环境下语音增强辅助的Stargan语音转换链接：https://arxiv.org/abs/2110.09923

【8】 Multi-Modal Pre-Training for Automated Speech Recognition 标题：自动语音识别中的多模态预训练链接：https://arxiv.org/abs/2110.09890

【9】 Personalized Speech Enhancement: New Models and Comprehensive Evaluation 标题：个性化语音增强：新模型与综合评价链接：https://arxiv.org/abs/2110.09625

【10】 Continual self-training with bootstrapped remixing for speech enhancement 标题：用于语音增强的自举混音连续自我训练链接：https://arxiv.org/abs/2110.10103

【11】 Temporal separation of whale vocalizations from background oceanic noise using a power calculation 标题：基于功率计算的鲸鱼发声与背景海洋噪声的时间分离链接：https://arxiv.org/abs/2110.10010

【12】 Speech Pattern based Black-box Model Watermarking for Automatic Speech Recognition 标题：基于语音模式的自动语音识别黑盒模型水印链接：https://arxiv.org/abs/2110.09814

【13】 SSAST: Self-Supervised Audio Spectrogram Transformer 标题：SSAST：自监督音频频谱转换器链接：https://arxiv.org/abs/2110.09784

【14】 Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation 标题：基于SUS约束的VAE和文本编码器聚合改进情感语音合成链接：https://arxiv.org/abs/2110.09780

【15】 Rep Works in Speaker Verification 标题：代表在说话人验证中工作链接：https://arxiv.org/abs/2110.09720

【16】 Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge 标题：神经词典阅读器：通过利用外部文本知识减少端到端TTS中的发音错误链接：https://arxiv.org/abs/2110.09698

【17】 Neural Synthesis of Footsteps Sound Effects with Generative Adversarial Networks 标题：基于产生式对抗性网络的脚步声效果神经合成链接：https://arxiv.org/abs/2110.09605

【18】 Who calls the shots? Rethinking Few-Shot Learning for Audio 标题：谁说了算？关于音频Few-Shot学习的再思考链接：https://arxiv.org/abs/2110.09600

【19】 Adversarial Domain Adaptation with Paired Examples for Acoustic Scene Classification on Different Recording Devices 标题：基于配对实例的对抗性领域自适应在不同记录设备上的声场分类链接：https://arxiv.org/abs/2110.09598

机器翻译，仅供参考

机器学习神经网络人工智能语音识别语音合成

0 人点赞