q-fin金融,共计4篇
cs.SD语音,共计7篇
eess.AS音频处理,共计9篇
1.q-fin金融:
【1】 Path Integral Method for Step Option Pricing 标题:阶梯期权定价的路径积分法 链接:https://arxiv.org/abs/2112.09534
作者:Qi Chen,Chao Guo 机构: School of Economics and Management, Langfang Normal University, Langfang , China, School of Science, Langfang Normal University, Langfang , China, ) 摘要:量子力学中的路径积分方法为障碍期权定价提供了新的思路。对于双台阶势垒方案,方案改变过程类似于粒子在有限对称方形势阱中运动。利用能级近似公式,可以得到期权价格的解析表达式。数值结果表明,向上和向外看涨的双步障碍期权价格随贴现时间和行使价格的增加而降低,但在S 0<B时随标的价格的增加而增加。对于固定标的价格、行使价格或贴现时间,期权价格随着障碍物V高度的增加而降低。 摘要:Path integral method in quantum mechanics offers a new thinking for barrier option pricing. For double step barrier options, option changing process is analogous to a particle moving in a finite symmetric square potential well. Using energy level approximate formulas, the analytical expression of option price could be acquired. Numerical results show that the up-and-out call double step barrier option price decreases with the increasing of discounting time and exercise price, but increases with the increasing of underlying price for S 0 < B. For a fixed underlying price, exercise price or discounting time, the option price decreases with the increasing of the height of the barrier V .
【2】 Free-Riding for Future: Field Experimental Evidence of Strategic Substitutability in Climate Protest 标题:搭乘未来的便车:气候抗议中战略替代的现场实验证据 链接:https://arxiv.org/abs/2112.09478
作者:Johannes Jarke-Neuert,Grischa Perino,Henrike Schwickert 备注:21 pages, 6 pages appendix, 4 figures 摘要:我们检验了一个假设,即潜在气候抗议者的成年人群中的抗议参与决策是相互依存的。来自德国四大城市的受试者(n=1510)在抗议日期前两周招募。我们在一项在线调查中测量了参与(事后)和其他受试者参与(事前)的信念,使用随机信息干预诱导信念的外源性差异,并使用控制函数方法估计信念变化对参与概率的因果影响。研究发现,参与决策是一种战略替代:信念增加1个百分点,平均受试者参与概率降低0.67个百分点。 摘要:We test the hypothesis that protest participation decisions in an adult population of potential climate protesters are interdependent. Subjects (n=1,510) from the four largest German cities were recruited two weeks before protest date. We measured participation (ex post) and beliefs about the other subjects' participation (ex ante) in an online survey, used a randomized informational intervention to induce exogenous variance in beliefs, and estimated the causal effect of a change in belief on the probability of participation using a control function approach. Participation decisions are found to be strategic substitutes: a one percentage-point increase of belief causes a .67 percentage-point decrease in the probability of participation in the average subject.
【3】 Discrete signature and its application to finance 标题:离散签名及其在金融中的应用 链接:https://arxiv.org/abs/2112.09342
作者:Takanori Adachi,Yusuke Naritomi 备注:13 pages 摘要:签名是粗糙路径理论的关键概念之一,作为机器学习系统中寻找合适特征集的一种手段,近年来得到了广泛的关注。在本文中,为了直接从离散数据计算签名,而不需要经过到连续数据的转换,我们引入了一种离散化的签名,称为“离散签名”。我们证明了离散签名可以表示在金融应用中具有高度相关性的二次变化。我们还引入了“加权签名”的概念,作为“离散签名”的一个变体。这一概念的定义反映了一个事实,即更接近当前时间的数据比较旧的数据更重要,并有望应用于时间序列分析。作为这两个概念的应用,我们研究了一个与股票市场相关的问题,并成功地用比以前更少的数据点进行了良好的估计。 摘要:Signatures, one of the key concepts of rough path theory, have recently gained prominence as a means to find appropriate feature sets in machine learning systems. In this paper, in order to compute signatures directly from discrete data without going through the transformation to continuous data, we introduced a discretized version of signatures, called "discrete signatures". We showed that discrete signatures can represent the quadratic variation that has a high relevance in financial applications. We also introduced the concept of "weighted signatures" as a variant of "discrete signatures". This concept is defined to reflect the fact that data closer to the current time is more important than older data, and is expected to be applied to time series analysis. As an application of these two concepts, we took up a stock market related problem and succeeded in performing a good estimation with fewer data points than before.
【4】 An adaptive splitting method for the Cox-Ingersoll-Ross process 标题:Cox-Ingersoll-Ross过程的一种自适应分裂方法 链接:https://arxiv.org/abs/2112.09465
作者:Córal Kelly,Gabriel J. Lord 备注:30 pages, 3 figures, submitted to the IMA Journal of Numerical Analysis 摘要:我们提出了一种新的分裂方法来求解Cox-Ingersoll-Ross模型的强数值解。对于这种方法,应用于确定性和自适应随机网格,我们证明了参数区域$kappatheta>sigma^2$的一致矩界和强误差结果为$1/4$in$L_1$和$L_2$。我们的方案不属于Hefter&Herzwurm(2018)中分析的类别,其中显示了一类新的基于Milstein的方法在整个参数值范围内的最大阶数$1/4$的收敛性。因此,在扩展新方法以覆盖所有参数值之前,我们提出了一个单独的收敛性分析,通过引入“软零”区域(其中确定性流决定近似值),给出了处理反射边界的混合型方法。通过数值模拟,我们观察到当$kappatheta>sigma^2$而不是$1/4$时,订单价格为$1$。渐近地,对于大噪声,我们观察到收敛速度的下降类似于其他方案,但所提出的方法显示较小的误差常数。我们的结果也支持了Hefter&Jentzen(2019)的猜想适用于非均匀Wiener增量方法的数值证据。 摘要:We propose a new splitting method for strong numerical solution of the Cox-Ingersoll-Ross model. For this method, applied over both deterministic and adaptive random meshes, we prove a uniform moment bound and strong error results of order $1/4$ in $L_1$ and $L_2$ for the parameter regime $kappatheta>sigma^2$. Our scheme does not fall into the class analyzed in Hefter & Herzwurm (2018) where convergence of maximum order $1/4$ of a novel class of Milstein-based methods over the full range of parameter values is shown. Hence we present a separate convergence analysis before we extend the new method to cover all parameter values by introducing a 'soft zero' region (where the deterministic flow determines the approximation) giving a hybrid type method to deal with the reflecting boundary. From numerical simulations we observe a rate of order $1$ when $kappatheta>sigma^2$ rather than $1/4$. Asymptotically, for large noise, we observe that the rates of convergence decrease similarly to those of other schemes but that the proposed method displays smaller error constants. Our results also serve as supporting numerical evidence that the conjecture of Hefter & Jentzen (2019) holds true for methods with non-uniform Wiener increments.
2.cs.SD语音:
【1】 Linguistic and Gender Variation in Speech Emotion Recognition using Spectral Features 标题:基于谱特征的语音情感识别中的语言和性别差异 链接:https://arxiv.org/abs/2112.09596
作者:Zachary Dair,Ryan Donovan,Ruairi O'Reilly 机构:Munster Technological University, Cork, Ireland, www.mtu.ie 备注:Preprint for the AICS 2021 Conference - Machine Learning for Time Series Section This publication has emanated from research supported in part by a Grant from Science Foundation Ireland under Grant number 18/CRT/6222 12 Pages, 5 Figures 摘要:这项工作探讨了性别和基于语言的声音变化对情感表达分类准确性的影响。情感表达是从语音的频谱特征(Mel频率倒谱系数、Mel频谱图、频谱对比度)的角度考虑的。情绪是从基本情绪理论的角度考虑的。卷积神经网络用于在英语、德语和意大利语的情感音频数据集中对情感表达进行分类。通过(i)识别合适光谱特征的比较分析,(ii)单语言、多语言和跨语言情感数据的分类性能,以及(iii)机器学习模型的经验评估,评估性别和语言变化对分类准确性的影响,评估光谱特征的语音变化。结果表明,光谱特征为增加情感表达分类提供了潜在的途径。此外,在单语言和跨语言情绪数据中,情绪表达分类的准确性较高,但在多语言数据中,分类的准确性较差。同样,不同性别人群的分类准确性也存在差异。这些结果证明了人口差异的重要性,以实现准确的语音情感识别。 摘要:This work explores the effect of gender and linguistic-based vocal variations on the accuracy of emotive expression classification. Emotive expressions are considered from the perspective of spectral features in speech (Mel-frequency Cepstral Coefficient, Melspectrogram, Spectral Contrast). Emotions are considered from the perspective of Basic Emotion Theory. A convolutional neural network is utilised to classify emotive expressions in emotive audio datasets in English, German, and Italian. Vocal variations for spectral features assessed by (i) a comparative analysis identifying suitable spectral features, (ii) the classification performance for mono, multi and cross-lingual emotive data and (iii) an empirical evaluation of a machine learning model to assess the effects of gender and linguistic variation on classification accuracy. The results showed that spectral features provide a potential avenue for increasing emotive expression classification. Additionally, the accuracy of emotive expression classification was high within mono and cross-lingual emotive data, but poor in multi-lingual data. Similarly, there were differences in classification accuracy between gender populations. These results demonstrate the importance of accounting for population differences to enable accurate speech emotion recognition.
【2】 Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem 标题:离散化和再综合:解决鸡尾酒会问题的另一种方法 链接:https://arxiv.org/abs/2112.09382
作者:Jing Shi,Xuankai Chang,Tomoki Hayashi,Yen-Ju Lu,Shinji Watanabe,Bo Xu 机构:Institute of Automation, Chinese Academy of Sciences (CASIA),Carnegie Mellon University, Nagoya University,Academia Sinica,Human Dataware Lab. Co., Ltd. 备注:5 pages, this https URL 摘要:基于深度学习的模型极大地提高了混合输入(如鸡尾酒会)的语音分离性能。突出的方法(例如,频域和时域语音分离)通常使用基于掩蔽的设计和信号电平损失准则(例如,MSE或SI-SNR)建立回归模型,从混合物中预测地面真实语音。这项研究首次表明,基于综合的方法也可以很好地解决这个问题,具有很大的灵活性和巨大的潜力。具体来说,我们提出了一种新的基于离散符号识别的语音分离/增强模型,并将语音分离/增强相关任务的范式从回归转换为分类。利用输入离散符号的合成模型,对离散符号序列进行预测后,对每个目标语音进行再合成。基于WSJ0-2mix和VCTK噪声语料库在不同环境下的评估结果表明,我们提出的方法能够稳定地合成分离语音,语音质量高,没有任何干扰,这在基于回归的方法中是难以避免的。此外,在听力质量损失很小的情况下,我们的方法可以很容易地实现增强/分离语音的说话人转换。 摘要:Deep learning based models have significantly improved the performance of speech separation with input mixtures like the cocktail party. Prominent methods (e.g., frequency-domain and time-domain speech separation) usually build regression models to predict the ground-truth speech from the mixture, using the masking-based design and the signal-level loss criterion (e.g., MSE or SI-SNR). This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem, with great flexibility and strong potential. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, and convert the paradigm of the speech separation/enhancement related tasks from regression to classification. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized. Evaluation results based on the WSJ0-2mix and VCTK-noisy corpora in various settings show that our proposed method can steadily synthesize the separated speech with high speech quality and without any interference, which is difficult to avoid in regression-based methods. In addition, with negligible loss of listening quality, the speaker conversion of enhanced/separated speech could be easily realized through our method.
【3】 Interpreting Audiograms with Multi-stage Neural Networks 标题:用多级神经网络解释听力图 链接:https://arxiv.org/abs/2112.09357
作者:Shufan Li,Congxi Lu,Linkai Li,Jirong Duan,Xinping Fu,Haoshuai Zhou 机构:Orka Labs Inc., ENT&Audiology Center, Xinhua Hospital∗, &Punan Hospital, Ear Institute, University College London 备注:12pages,12 figures. The code for this project is available at this https URL 摘要:听力图是一种特殊类型的线状图,代表个人在不同频率下的听力水平。听力专家使用它们来诊断听力损失,并进一步为客户选择和调整合适的助听器。有几个项目,如Autoaudio,旨在通过机器学习加速这一过程。但现有的所有模型都只能检测图像中的听力图,并将其分类为一般类别。他们无法通过解释标记、轴和线从检测到的听力图中提取听力水平信息。为了解决这个问题,我们提出了一个多级听力图解释网络(MAIN),它直接从听力图的照片中读取听力水平数据。我们还建立了开放听力图,这是一个开放的听力图图像数据集,带有标记和轴的注释,我们在此基础上训练和评估了我们提出的模型。实验表明,该模型是可行和可靠的。 摘要:Audiograms are a particular type of line charts representing individuals' hearing level at various frequencies. They are used by audiologists to diagnose hearing loss, and further select and tune appropriate hearing aids for customers. There have been several projects such as Autoaudio that aim to accelerate this process through means of machine learning. But all existing models at their best can only detect audiograms in images and classify them into general categories. They are unable to extract hearing level information from detected audiograms by interpreting the marks, axis, and lines. To address this issue, we propose a Multi-stage Audiogram Interpretation Network (MAIN) that directly reads hearing level data from photos of audiograms. We also established Open Audiogram, an open dataset of audiogram images with annotations of marks and axes on which we trained and evaluated our proposed model. Experiments show that our model is feasible and reliable.
【4】 JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification 标题:JTubeSpeech:从YouTube收集的日语语音语料库,用于语音识别和说话人验证 链接:https://arxiv.org/abs/2112.09323
作者:Shinnosuke Takamichi,Ludwig Kürzinger,Takaaki Saeki,Sayaka Shiota,Shinji Watanabe 机构:The University of Tokyo, Japan,Technical University of Munich, Germany, Tokyo Metropolitan University, Japan,Carnegie Mellon University, USA 备注:Submitted to ICASSP2022 摘要:在本文中,我们构建了一个新的日语语音语料库“JTubeSpeech”尽管最近的端到端学习需要大规模的语音语料库,但除了英语以外的其他语言的开源语音语料库尚未建立。在本文中,我们描述了从YouTube视频和字幕构建语料库,用于语音识别和说话人验证。我们的方法可以自动过滤视频和字幕,几乎没有语言依赖的过程。我们一贯采用基于连接主义时间分类(CTC)的自动语音识别(ASR)技术和基于说话人变化的自动说话人确认(ASV)方法。我们构建了1)一个大规模的日本ASR基准,其中包含1300多小时的数据,2)900小时的日本ASV数据。 摘要:In this paper, we construct a new Japanese speech corpus called "JTubeSpeech." Although recent end-to-end learning requires large-size speech corpora, open-sourced such corpora for languages other than English have not yet been established. In this paper, we describe the construction of a corpus from YouTube videos and subtitles for speech recognition and speaker verification. Our method can automatically filter the videos and subtitles with almost no language-dependent processes. We consistently employ Connectionist Temporal Classification (CTC)-based techniques for automatic speech recognition (ASR) and a speaker variation-based method for automatic speaker verification (ASV). We build 1) a large-scale Japanese ASR benchmark with more than 1,300 hours of data and 2) 900 hours of data for Japanese ASV.
【5】 MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling 标题:MIDI-DDSP:基于分层建模的音乐演奏细节控制 链接:https://arxiv.org/abs/2112.09312
作者:Yusong Wu,Ethan Manilow,Yi Deng,Rigel Swavely,Kyle Kastner,Tim Cooijmans,Aaron Courville,Cheng-Zhi Anna Huang,Jesse Engel 机构:∗ Equal Contribution, Mila, Quebec Artificial Intelligence Institute, Universit´e de Montr´eal, Northwestern University,New York University,Google Brain 摘要:音乐表达需要控制演奏的音符和演奏方式。传统的音频合成器提供了详细的表现力控制,但以牺牲真实感为代价。黑盒神经音频合成和级联采样器可以产生逼真的音频,但控制机制很少。在这项工作中,我们介绍了MIDI-DDSP这一乐器的分层模型,它支持真实的神经音频合成和详细的用户控制。从可解释可微数字信号处理(DDSP)合成参数开始,我们推断音符及其表现性能的高级属性(如音色、颤音、动力学和清晰度)。这将创建一个三级层次结构(注释、绩效、综合),使个人可以选择在每一级进行干预,或利用经过训练的先验知识(绩效注释、综合绩效)进行创造性帮助。通过定量实验和听力测试,我们证明了该层次结构可以重建高保真音频,准确预测音符序列的性能属性,独立操作给定性能的属性,并且作为一个完整的系统,可以从一个新的音符序列生成逼真的音频。MIDI-DDSP通过利用可解释的层次结构和多层次的粒度,打开了辅助工具的大门,使个人能够跨越各种各样的音乐体验。 摘要:Musical expression requires control of both what notes are played, and how they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control. Starting from interpretable Differentiable Digital Signal Processing (DDSP) synthesis parameters, we infer musical notes and high-level properties of their expressive performance (such as timbre, vibrato, dynamics, and articulation). This creates a 3-level hierarchy (notes, performance, synthesis) that affords individuals the option to intervene at each level, or utilize trained priors (performance given notes, synthesis given performance) for creative assistance. Through quantitative experiments and listening tests, we demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence. By utilizing an interpretable hierarchy, with multiple levels of granularity, MIDI-DDSP opens the door to assistive tools to empower individuals across a diverse range of musical experience.
【6】 EEG-Transformer: Self-attention from Transformer Architecture for Decoding EEG of Imagined Speech 标题:EEG-Transformer:从译码想象语音EEG的Transformer体系结构看自我关注 链接:https://arxiv.org/abs/2112.09239
作者:Young-Eun Lee,Seo-Hyun Lee 机构:Dept. Brain and Cognitive Engineering, Korea University, Seoul, Republic of Korea 备注:submitted to IEEE BCI Winter Conference 摘要:transformer是开创性的体系结构,它改变了深度学习的流程,许多高性能模型都是基于transformer体系结构开发的。Transformer仅采用遵循seq2seq的编码器-解码器结构实现,不使用RNN,但性能优于RNN。在此,我们研究了在想象语音和公开语音中由Transformer结构的自我注意模块组成的脑电图(EEG)解码技术。我们使用基于EEGNet的卷积神经网络对9名受试者进行分类,该网络从想象语音和公开语音的EEG中捕获时间-频谱-空间特征。此外,我们将自我注意模块应用于EEG解码,以提高性能并减少参数数量。我们的结果证明了利用注意模块解码想象语音和公开语音的大脑活动的可能性。此外,只有单通道EEG或耳EEG可用于实际BCI中的想象语音解码。 摘要:Transformers are groundbreaking architectures that have changed a flow of deep learning, and many high-performance models are developing based on transformer architectures. Transformers implemented only with attention with encoder-decoder structure following seq2seq without using RNN, but had better performance than RNN. Herein, we investigate the decoding technique for electroencephalography (EEG) composed of self-attention module from transformer architecture during imagined speech and overt speech. We performed classification of nine subjects using convolutional neural network based on EEGNet that captures temporal-spectral-spatial features from EEG of imagined speech and overt speech. Furthermore, we applied the self-attention module to decoding EEG to improve the performance and lower the number of parameters. Our results demonstrate the possibility of decoding brain activities of imagined speech and overt speech using attention modules. Also, only single channel EEG or ear-EEG can be used to decode the imagined speech for practical BCIs.
【7】 Audio Retrieval with Natural Language Queries: A Benchmark Study 标题:基于自然语言查询的音频检索:一项基准研究 链接:https://arxiv.org/abs/2112.09418
作者:A. Sophia Koepke,Andreea-Maria Oncescu,João F. Henriques,Zeynep Akata,Samuel Albanie 机构: Akata are with the Explainable Machine Learninggroup at the University of T¨ubingen, Akata is also affiliated with the MaxPlanck Institute for Intelligent Systems, T¨ubingen and the Max Planck Institutefor Informatics 备注:Submitted to Transactions on Multimedia. arXiv admin note: substantial text overlap with arXiv:2105.02192 摘要:这项工作的目标是跨模态文本音频和音频文本检索,其中的目标是从最符合给定书面描述的候选库中检索音频内容,反之亦然。文本音频检索使用户能够通过直观的界面搜索大型数据库:他们只需发布他们想要听到的声音的自由形式自然语言描述。为了研究在现有文献中受到有限关注的文本音频和音频文本检索任务,我们引入了三个具有挑战性的新基准。我们首先从AudioCaps和Clotho音频字幕数据集构建文本音频和音频文本检索基准。此外,我们还介绍了SoundDescs基准测试,该测试包括成对的音频和自然语言描述,用于各种不同的声音集合,这些声音与AudioCaps和Clotho中的声音是互补的。我们使用这三个基准来建立跨模态文本音频和音频文本检索的基线,在这里我们展示了对不同音频任务进行预训练的好处。我们希望,我们的基准测试将激发对自由格式文本查询音频检索的进一步研究。所有使用的数据集的代码、音频功能以及datasetName数据集将公开。 摘要:The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho. We employ these three benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into audio retrieval with free-form text queries. Code, audio features for all datasets used, and the datasetName dataset will be made publicly available.
3.eess.AS音频处理:
【1】 Dialog in Broadcasting: First Field Tests Using Deep-Learning-Based Dialogue Enhancement 标题:广播中的Dialog :首次使用基于深度学习的对话增强进行现场测试 链接:https://arxiv.org/abs/2112.09494
作者:Matteo Torcoli,Christian Simon,Jouni Paulus,Davide Straninger,Alfred Riedel,Volker Koch,Stefan Wits,Daniela Rieger,Harald Fuchs,Christian Uhle,Stefan Meltzer,Adrian Murtaza 机构: Murtaza 1 1 Fraunhofer Institute for Integrated Circuits IIS 备注:Presented at IBC 2021 (International Broadcasting Convention) 摘要:在广播中,由于背景声音太大,很难跟上讲话。基于对象的音频(例如,MPEG-H音频)通过提供用户可调节的语音级别来解决此问题。虽然基于对象的音频越来越流行,但过渡到它需要时间和精力。此外,大量内容在基于对象的工作流之外存在、生成和存档。为了解决这个问题,Fraunhofer IIS开发了一个名为Dialog 的深度学习解决方案,该解决方案还能够为只有最终音频曲目的内容启用语音级个性化。本文报告了自2020年9月起,与Westdeutscher Rundfunk(WDR)和Bayerischer Rundfunk(BR)共同进行的公共现场测试评估Dialog 。据我们所知,这是第一次此类大规模试验。其中一项是对2000多名参与者进行的调查显示,90%的60岁以上的人在“经常”或“非常经常”理解电视讲话方面存在问题。总的来说,83%的参与者喜欢切换到Dialog 的可能性,包括那些通常不受语言清晰度影响的人。Dialog 为观众带来了明显的好处,填补了基于对象的广播和传统制作材料之间的空白。 摘要:Difficulties in following speech due to loud background sounds are common in broadcasting. Object-based audio, e.g., MPEG-H Audio solves this problem by providing a user-adjustable speech level. While object-based audio is gaining momentum, transitioning to it requires time and effort. Also, lots of content exists, produced and archived outside the object-based workflows. To address this, Fraunhofer IIS has developed a deep-learning solution called Dialog , capable of enabling speech level personalization also for content with only the final audio tracks available. This paper reports on public field tests evaluating Dialog , conducted together with Westdeutscher Rundfunk (WDR) and Bayerischer Rundfunk (BR), starting from September 2020. To our knowledge, these are the first large-scale tests of this kind. As part of one of these, a survey with more than 2,000 participants showed that 90% of the people above 60 years old have problems in understanding speech in TV "often" or "very often". Overall, 83% of the participants liked the possibility to switch to Dialog , including those who do not normally struggle with speech intelligibility. Dialog introduces a clear benefit for the audience, filling the gap between object-based broadcasting and traditionally produced material.
【2】 Continual Learning for Monolingual End-to-End Automatic Speech Recognition 标题:基于连续学习的单语端到端自动语音识别 链接:https://arxiv.org/abs/2112.09427
作者:Steven Vander Eeckt,Hugo Van hamme 机构:KU Leuven, Department Electrical Engineering ESAT-PSI, Leuven, Belgium 备注:Submitted to ICASSP 2021. 5 pages, 1 figure 摘要:使自动语音识别(ASR)模型适应新的领域会导致原始领域的性能下降,这种现象称为灾难性遗忘(CF)。即使是单语ASR模型也无法扩展到新的口音、方言、主题等,而不会受到CF的影响,这使得它们无法在不存储所有过去数据的情况下不断增强。幸运的是,可以使用持续学习(CL)方法,其目的是在克服CF的同时实现持续适应。在本文中,我们为端到端ASR实现了大量的CL方法,并测试和比较了它们在四个新任务中扩展单语混合CTCTransformer模型的能力。我们发现,性能最佳的CL方法将微调模型(下限)和所有任务联合训练的模型(上限)之间的差距缩小了40%以上,同时只需要访问0.6%的原始数据。 摘要:Adapting Automatic Speech Recognition (ASR) models to new domains leads to a deterioration of performance on the original domain(s), a phenomenon called Catastrophic Forgetting (CF). Even monolingual ASR models cannot be extended to new accents, dialects, topics, etc. without suffering from CF, making them unable to be continually enhanced without storing all past data. Fortunately, Continual Learning (CL) methods, which aim to enable continual adaptation while overcoming CF, can be used. In this paper, we implement an extensive number of CL methods for End-to-End ASR and test and compare their ability to extend a monolingual Hybrid CTC-Transformer model across four new tasks. We find that the best performing CL method closes the gap between the fine-tuned model (lower bound) and the model trained jointly on all tasks (upper bound) by more than 40%, while requiring access to only 0.6% of the original data.
【3】 Audio Retrieval with Natural Language Queries: A Benchmark Study 标题:基于自然语言查询的音频检索:一项基准研究 链接:https://arxiv.org/abs/2112.09418
作者:A. Sophia Koepke,Andreea-Maria Oncescu,João F. Henriques,Zeynep Akata,Samuel Albanie 机构: Akata are with the Explainable Machine Learninggroup at the University of T¨ubingen, Akata is also affiliated with the MaxPlanck Institute for Intelligent Systems, T¨ubingen and the Max Planck Institutefor Informatics 备注:Submitted to Transactions on Multimedia. arXiv admin note: substantial text overlap with arXiv:2105.02192 摘要:这项工作的目标是跨模态文本音频和音频文本检索,其中的目标是从最符合给定书面描述的候选库中检索音频内容,反之亦然。文本音频检索使用户能够通过直观的界面搜索大型数据库:他们只需发布他们想要听到的声音的自由形式自然语言描述。为了研究在现有文献中受到有限关注的文本音频和音频文本检索任务,我们引入了三个具有挑战性的新基准。我们首先从AudioCaps和Clotho音频字幕数据集构建文本音频和音频文本检索基准。此外,我们还介绍了SoundDescs基准测试,该测试包括成对的音频和自然语言描述,用于各种不同的声音集合,这些声音与AudioCaps和Clotho中的声音是互补的。我们使用这三个基准来建立跨模态文本音频和音频文本检索的基线,在这里我们展示了对不同音频任务进行预训练的好处。我们希望,我们的基准测试将激发对自由格式文本查询音频检索的进一步研究。所有使用的数据集的代码、音频功能以及datasetName数据集将公开。 摘要:The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho. We employ these three benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into audio retrieval with free-form text queries. Code, audio features for all datasets used, and the datasetName dataset will be made publicly available.
【4】 Linguistic and Gender Variation in Speech Emotion Recognition using Spectral Features 标题:基于谱特征的语音情感识别中的语言和性别差异 链接:https://arxiv.org/abs/2112.09596
作者:Zachary Dair,Ryan Donovan,Ruairi O'Reilly 机构:Munster Technological University, Cork, Ireland, www.mtu.ie 备注:Preprint for the AICS 2021 Conference - Machine Learning for Time Series Section This publication has emanated from research supported in part by a Grant from Science Foundation Ireland under Grant number 18/CRT/6222 12 Pages, 5 Figures 摘要:这项工作探讨了性别和基于语言的声音变化对情感表达分类准确性的影响。情感表达是从语音的频谱特征(Mel频率倒谱系数、Mel频谱图、频谱对比度)的角度考虑的。情绪是从基本情绪理论的角度考虑的。卷积神经网络用于在英语、德语和意大利语的情感音频数据集中对情感表达进行分类。通过(i)识别合适光谱特征的比较分析,(ii)单语言、多语言和跨语言情感数据的分类性能,以及(iii)机器学习模型的经验评估,评估性别和语言变化对分类准确性的影响,评估光谱特征的语音变化。结果表明,光谱特征为增加情感表达分类提供了潜在的途径。此外,在单语言和跨语言情绪数据中,情绪表达分类的准确性较高,但在多语言数据中,分类的准确性较差。同样,不同性别人群的分类准确性也存在差异。这些结果证明了人口差异的重要性,以实现准确的语音情感识别。 摘要:This work explores the effect of gender and linguistic-based vocal variations on the accuracy of emotive expression classification. Emotive expressions are considered from the perspective of spectral features in speech (Mel-frequency Cepstral Coefficient, Melspectrogram, Spectral Contrast). Emotions are considered from the perspective of Basic Emotion Theory. A convolutional neural network is utilised to classify emotive expressions in emotive audio datasets in English, German, and Italian. Vocal variations for spectral features assessed by (i) a comparative analysis identifying suitable spectral features, (ii) the classification performance for mono, multi and cross-lingual emotive data and (iii) an empirical evaluation of a machine learning model to assess the effects of gender and linguistic variation on classification accuracy. The results showed that spectral features provide a potential avenue for increasing emotive expression classification. Additionally, the accuracy of emotive expression classification was high within mono and cross-lingual emotive data, but poor in multi-lingual data. Similarly, there were differences in classification accuracy between gender populations. These results demonstrate the importance of accounting for population differences to enable accurate speech emotion recognition.
【5】 Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem 标题:离散化和再综合:解决鸡尾酒会问题的另一种方法 链接:https://arxiv.org/abs/2112.09382
作者:Jing Shi,Xuankai Chang,Tomoki Hayashi,Yen-Ju Lu,Shinji Watanabe,Bo Xu 机构:Institute of Automation, Chinese Academy of Sciences (CASIA),Carnegie Mellon University, Nagoya University,Academia Sinica,Human Dataware Lab. Co., Ltd. 备注:5 pages, this https URL 摘要:基于深度学习的模型极大地提高了混合输入(如鸡尾酒会)的语音分离性能。突出的方法(例如,频域和时域语音分离)通常使用基于掩蔽的设计和信号电平损失准则(例如,MSE或SI-SNR)建立回归模型,从混合物中预测地面真实语音。这项研究首次表明,基于综合的方法也可以很好地解决这个问题,具有很大的灵活性和巨大的潜力。具体来说,我们提出了一种新的基于离散符号识别的语音分离/增强模型,并将语音分离/增强相关任务的范式从回归转换为分类。利用输入离散符号的合成模型,对离散符号序列进行预测后,对每个目标语音进行再合成。基于WSJ0-2mix和VCTK噪声语料库在不同环境下的评估结果表明,我们提出的方法能够稳定地合成分离语音,语音质量高,没有任何干扰,这在基于回归的方法中是难以避免的。此外,在听力质量损失很小的情况下,我们的方法可以很容易地实现增强/分离语音的说话人转换。 摘要:Deep learning based models have significantly improved the performance of speech separation with input mixtures like the cocktail party. Prominent methods (e.g., frequency-domain and time-domain speech separation) usually build regression models to predict the ground-truth speech from the mixture, using the masking-based design and the signal-level loss criterion (e.g., MSE or SI-SNR). This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem, with great flexibility and strong potential. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, and convert the paradigm of the speech separation/enhancement related tasks from regression to classification. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized. Evaluation results based on the WSJ0-2mix and VCTK-noisy corpora in various settings show that our proposed method can steadily synthesize the separated speech with high speech quality and without any interference, which is difficult to avoid in regression-based methods. In addition, with negligible loss of listening quality, the speaker conversion of enhanced/separated speech could be easily realized through our method.
【6】 Interpreting Audiograms with Multi-stage Neural Networks 标题:用多级神经网络解释听力图 链接:https://arxiv.org/abs/2112.09357
作者:Shufan Li,Congxi Lu,Linkai Li,Jirong Duan,Xinping Fu,Haoshuai Zhou 机构:Orka Labs Inc., ENT&Audiology Center, Xinhua Hospital∗, &Punan Hospital, Ear Institute, University College London 备注:12pages,12 figures. The code for this project is available at this https URL 摘要:听力图是一种特殊类型的线状图,代表个人在不同频率下的听力水平。听力专家使用它们来诊断听力损失,并进一步为客户选择和调整合适的助听器。有几个项目,如Autoaudio,旨在通过机器学习加速这一过程。但现有的所有模型都只能检测图像中的听力图,并将其分类为一般类别。他们无法通过解释标记、轴和线从检测到的听力图中提取听力水平信息。为了解决这个问题,我们提出了一个多级听力图解释网络(MAIN),它直接从听力图的照片中读取听力水平数据。我们还建立了开放听力图,这是一个开放的听力图图像数据集,带有标记和轴的注释,我们在此基础上训练和评估了我们提出的模型。实验表明,该模型是可行和可靠的。 摘要:Audiograms are a particular type of line charts representing individuals' hearing level at various frequencies. They are used by audiologists to diagnose hearing loss, and further select and tune appropriate hearing aids for customers. There have been several projects such as Autoaudio that aim to accelerate this process through means of machine learning. But all existing models at their best can only detect audiograms in images and classify them into general categories. They are unable to extract hearing level information from detected audiograms by interpreting the marks, axis, and lines. To address this issue, we propose a Multi-stage Audiogram Interpretation Network (MAIN) that directly reads hearing level data from photos of audiograms. We also established Open Audiogram, an open dataset of audiogram images with annotations of marks and axes on which we trained and evaluated our proposed model. Experiments show that our model is feasible and reliable.
【7】 JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification 标题:JTubeSpeech:从YouTube收集的日语语音语料库,用于语音识别和说话人验证 链接:https://arxiv.org/abs/2112.09323
作者:Shinnosuke Takamichi,Ludwig Kürzinger,Takaaki Saeki,Sayaka Shiota,Shinji Watanabe 机构:The University of Tokyo, Japan,Technical University of Munich, Germany, Tokyo Metropolitan University, Japan,Carnegie Mellon University, USA 备注:Submitted to ICASSP2022 摘要:在本文中,我们构建了一个新的日语语音语料库“JTubeSpeech”尽管最近的端到端学习需要大规模的语音语料库,但除了英语以外的其他语言的开源语音语料库尚未建立。在本文中,我们描述了从YouTube视频和字幕构建语料库,用于语音识别和说话人验证。我们的方法可以自动过滤视频和字幕,几乎没有语言依赖的过程。我们一贯采用基于连接主义时间分类(CTC)的自动语音识别(ASR)技术和基于说话人变化的自动说话人确认(ASV)方法。我们构建了1)一个大规模的日本ASR基准,其中包含1300多小时的数据,2)900小时的日本ASV数据。 摘要:In this paper, we construct a new Japanese speech corpus called "JTubeSpeech." Although recent end-to-end learning requires large-size speech corpora, open-sourced such corpora for languages other than English have not yet been established. In this paper, we describe the construction of a corpus from YouTube videos and subtitles for speech recognition and speaker verification. Our method can automatically filter the videos and subtitles with almost no language-dependent processes. We consistently employ Connectionist Temporal Classification (CTC)-based techniques for automatic speech recognition (ASR) and a speaker variation-based method for automatic speaker verification (ASV). We build 1) a large-scale Japanese ASR benchmark with more than 1,300 hours of data and 2) 900 hours of data for Japanese ASV.
【8】 MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling 标题:MIDI-DDSP:基于分层建模的音乐演奏细节控制 链接:https://arxiv.org/abs/2112.09312
作者:Yusong Wu,Ethan Manilow,Yi Deng,Rigel Swavely,Kyle Kastner,Tim Cooijmans,Aaron Courville,Cheng-Zhi Anna Huang,Jesse Engel 机构:∗ Equal Contribution, Mila, Quebec Artificial Intelligence Institute, Universit´e de Montr´eal, Northwestern University,New York University,Google Brain 摘要:音乐表达需要控制演奏的音符和演奏方式。传统的音频合成器提供了详细的表现力控制,但以牺牲真实感为代价。黑盒神经音频合成和级联采样器可以产生逼真的音频,但控制机制很少。在这项工作中,我们介绍了MIDI-DDSP这一乐器的分层模型,它支持真实的神经音频合成和详细的用户控制。从可解释可微数字信号处理(DDSP)合成参数开始,我们推断音符及其表现性能的高级属性(如音色、颤音、动力学和清晰度)。这将创建一个三级层次结构(注释、绩效、综合),使个人可以选择在每一级进行干预,或利用经过训练的先验知识(绩效注释、综合绩效)进行创造性帮助。通过定量实验和听力测试,我们证明了该层次结构可以重建高保真音频,准确预测音符序列的性能属性,独立操作给定性能的属性,并且作为一个完整的系统,可以从一个新的音符序列生成逼真的音频。MIDI-DDSP通过利用可解释的层次结构和多层次的粒度,打开了辅助工具的大门,使个人能够跨越各种各样的音乐体验。 摘要:Musical expression requires control of both what notes are played, and how they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control. Starting from interpretable Differentiable Digital Signal Processing (DDSP) synthesis parameters, we infer musical notes and high-level properties of their expressive performance (such as timbre, vibrato, dynamics, and articulation). This creates a 3-level hierarchy (notes, performance, synthesis) that affords individuals the option to intervene at each level, or utilize trained priors (performance given notes, synthesis given performance) for creative assistance. Through quantitative experiments and listening tests, we demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence. By utilizing an interpretable hierarchy, with multiple levels of granularity, MIDI-DDSP opens the door to assistive tools to empower individuals across a diverse range of musical experience.
【9】 EEG-Transformer: Self-attention from Transformer Architecture for Decoding EEG of Imagined Speech 标题:EEG-Transformer:从译码想象语音EEG的Transformer体系结构看自我关注 链接:https://arxiv.org/abs/2112.09239
作者:Young-Eun Lee,Seo-Hyun Lee 机构:Dept. Brain and Cognitive Engineering, Korea University, Seoul, Republic of Korea 备注:submitted to IEEE BCI Winter Conference 摘要:transformer是开创性的体系结构,它改变了深度学习的流程,许多高性能模型都是基于transformer体系结构开发的。Transformer仅采用遵循seq2seq的编码器-解码器结构实现,不使用RNN,但性能优于RNN。在此,我们研究了在想象语音和公开语音中由Transformer结构的自我注意模块组成的脑电图(EEG)解码技术。我们使用基于EEGNet的卷积神经网络对9名受试者进行分类,该网络从想象语音和公开语音的EEG中捕获时间-频谱-空间特征。此外,我们将自我注意模块应用于EEG解码,以提高性能并减少参数数量。我们的结果证明了利用注意模块解码想象语音和公开语音的大脑活动的可能性。此外,只有单通道EEG或耳EEG可用于实际BCI中的想象语音解码。 摘要:Transformers are groundbreaking architectures that have changed a flow of deep learning, and many high-performance models are developing based on transformer architectures. Transformers implemented only with attention with encoder-decoder structure following seq2seq without using RNN, but had better performance than RNN. Herein, we investigate the decoding technique for electroencephalography (EEG) composed of self-attention module from transformer architecture during imagined speech and overt speech. We performed classification of nine subjects using convolutional neural network based on EEGNet that captures temporal-spectral-spatial features from EEG of imagined speech and overt speech. Furthermore, we applied the self-attention module to decoding EEG to improve the performance and lower the number of parameters. Our results demonstrate the possibility of decoding brain activities of imagined speech and overt speech using attention modules. Also, only single channel EEG or ear-EEG can be used to decode the imagined speech for practical BCIs.
机器翻译,仅供参考