金融/语音/音频处理学术速递[8.18]

2021-08-24 16:26:08 浏览数 (1)

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

q-fin金融,共计2篇

cs.SD语音,共计7篇

eess.AS音频处理,共计7篇

1.q-fin金融:

【1】 Analysis of Data Mining Process for Improvement of Production Quality in Industrial Sector 标题:提高工业部门生产质量的数据挖掘过程分析 链接:https://arxiv.org/abs/2108.07615

作者:Hamza Saad,Nagendra Nagarur,Abdulrahman Shamsan 机构:Department of System Sciences and Industrial Engineering, Binghamton University, New York, USA 备注:11 pages, 2 figure, 11 tables, peer reviewed paper 摘要:背景和目标:不同行业都经历了高精度和复杂的过程,需要在成长之前分析数据并发现缺陷。大数据可能包含大量变量,其中遗漏的数据对于理解影响质量的因素起着至关重要的作用。因此,该过程的专家可能很难确定哪些变量对该过程有直接影响。本研究的目的是利用数据挖掘和质量工具建立集成数据分析,以提高生产和工艺质量。材料和方法:通过不同步骤收集数据,以减少遗漏数据。生产过程中的专家建议从大数据中选择最重要的变量,然后使用预测筛选确定71个变量中的16个。七个重要变量构成了输出变量,称为纺织品质量分数。在测试了十种算法之后,对增强树和随机林进行了评估,以提取知识。在投票过程中,确定了三个变量作为实验设计的输入因子。设计响应通过数据挖掘进行估算,结果由质量专家确认。中央复合材料(表面响应)已运行17次,以提取对纺织品质量分数的主要影响和相互作用。结果:目前的研究发现机器生产率对质量有负面影响,因此管理层对此进行了验证。应用这些变化后,生产效率提高了21%。结论:结果证实工业部门的质量工艺有了很大的改进。生产效率提高到21%,织造工艺提高到23%,整体工艺提高到17.06%。 摘要:Background and Objective: Different industries go through high-precision and complex processes that need to analyze their data and discover defects before growing up. Big data may contain large variables with missed data that play a vital role to understand what affect the quality. So, specialists of the process might be struggling to defined what are the variables that have direct effect in the process. Aim of this study was to build integrated data analysis using data mining and quality tools to improve the quality of production and process. Materials and Methods: Data collected in different steps to reduce missed data. The specialists in the production process recommended to select the most important variables from big data and then predictor screening was used to confirm 16 of 71 variables. Seven important variables built the output variable that called textile quality score. After testing ten algorithms, boosted tree and random forest were evaluated to extract knowledge. In the voting process, three variables were confirmed to use as input factors in the design of experiments. The response of design was estimated by data mining and the results were confirmed by the quality specialists. Central composite (surface response) has been run 17 times to extract the main effects and interactions on the textile quality score. Results: Current study found that a machine productivity has negative effect on the quality, so this validated by the management. After applying changes, the efficiency of production has improved 21%. Conclusion: Results confirmed a big improvement in quality processes in industrial sector. The efficiency of production improved to 21%, weaving process improved to 23% and the overall process improved to 17.06%.

【2】 Elephants or Goldfish? An Empirical Analysis of Carrier Reciprocity in Dynamic Freight Markets 标题:大象还是金鱼?动态货运市场中承运人互惠关系的实证分析 链接:https://arxiv.org/abs/2108.07348

作者:Angela Acocella,Chris Caplice,Yossi Sheffi 机构:MIT Center for Transportation & Logistics, Amherst St. Cambridge, MA 备注:None 摘要:动态宏观经济条件和无约束力的卡车货运合同使托运人和承运人都能机会主义地行事。我们对美国卡车运输行业的承运人互惠性进行了实证分析,以证明当市场对托运人有利时,托运人的一致表现和公平定价是否会导致在市场转向时维持主要承运人的投标接受。结果表明携带者记忆时间短;他们不记得托运人在做出货物验收决定时的前期定价、投标行为或表现。然而,承运人似乎目光短浅,对托运人当前市场时期的行为作出反应,表面上不考虑托运人以前的行为。 摘要:Dynamic macroeconomic conditions and non-binding truckload freight contracts enable both shippers and carriers to behave opportunistically. We present an empirical analysis of carrier reciprocity in the US truckload transportation sector to demonstrate whether consistent performance and fair pricing by shippers when markets are in their favor result in maintained primary carrier tender acceptance when markets turn. The results suggest carriers have short memories; they do not remember shippers' previous period pricing, tendering behavior, or performance when making freight acceptance decisions. However, carriers appear to be myopic and respond to shippers' current market period behaviors, ostensibly without regard to shippers' previous behaviors.

2.cs.SD语音:

【1】 Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition 标题:采用GPT、GPT-2和BERT语言模型进行语音识别 链接:https://arxiv.org/abs/2108.07789

作者:Xianrui Zheng,Chao Zhang,Philip C. Woodland 机构:Cambridge University Engineering Dept., Trumpington St., Cambridge, CB,PZ U.K. 摘要:语言模型(LMs)在大量文本上预先训练,特别是来自Transformers(BERT)、generative pre training(GPT)和GPT-2的双向编码器表示,已经成为许多自然语言处理任务的关键技术。在本文中,我们介绍了使用微调GPT、GPT-2及其组合进行自动语音识别(ASR)的结果。与单向LM-GPT和GPT-2不同,BERT是双向的,其输出概率的直接乘积不再是有效的语言先验概率。提出了一种转换方法,以数学精确的方式计算基于双向LM输出的正确语言先验概率。在广泛使用的AMI和交换机ASR任务上的实验结果表明,微调GPT和GPT-2的组合优于三种不同结构的神经LMs的组合,在域内文本上从头开始训练,相对字错误率降低(WERR)高达12%。此外,建议的语言先验概率转换使BERT能够接收额外的3%相对WERR,并且BERT、GPT和GPT-2的组合导致进一步的改进。 摘要:Language models (LMs) pre-trained on massive amounts of text, in particular bidirectional encoder representations from Transformers (BERT), generative pre-training (GPT), and GPT-2, have become a key technology for many natural language processing tasks. In this paper, we present results using fine-tuned GPT, GPT-2, and their combination for automatic speech recognition (ASR). Unlike unidirectional LM GPT and GPT-2, BERT is bidirectional whose direct product of the output probabilities is no longer a valid language prior probability. A conversion method is proposed to compute the correct language prior probability based on bidirectional LM outputs in a mathematically exact way. Experimental results on the widely used AMI and Switchboard ASR tasks showed that the combination of the fine-tuned GPT and GPT-2 outperformed the combination of three neural LMs with different architectures trained from scratch on the in-domain text by up to a 12% relative word error rate reduction (WERR). Furthermore, the proposed conversion for language prior probabilities enables BERT to receive an extra 3% relative WERR, and the combination of BERT, GPT and GPT-2 results in further improvements.

【2】 Combining speakers of multiple languages to improve quality of neural voices 标题:组合多种语言的说话人以提高神经语音的质量 链接:https://arxiv.org/abs/2108.07737

作者:Javier Latorre,Charlotte Bailleul,Tuuli Morrill,Alistair Conkie,Yannis Stylianou 机构:Apple 备注:6 pages. 3 figures. Accepted to 11th Speech Synthesis Workshop, SSW11 (this https URL) 摘要:在这项工作中,我们探索了开发多说话人和多语言神经TTS系统的多种架构和训练程序,目标是a)在目标语言可用数据有限时提高质量,b)实现跨语言合成。我们报告了一项大型实验的结果,该实验使用了来自15个不同地区的8种不同语言的30名说话者。系统在每个说话人的相同数据量上进行训练。与单扬声器模型相比,当建议的系统对扬声器进行微调时,在大多数情况下,其产生的质量显著提高,而用于构建单扬声器模型的扬声器数据仅使用不到$40%$。在跨语言合成中,就平均意见得分而言,生成的质量在母语单语者模型的80%以内。 摘要:In this work, we explore multiple architectures and training procedures for developing a multi-speaker and multi-lingual neural TTS system with the goals of a) improving the quality when the available data in the target language is limited and b) enabling cross-lingual synthesis. We report results from a large experiment using 30 speakers in 8 different languages across 15 different locales. The system is trained on the same amount of data per speaker. Compared to a single-speaker model, when the suggested system is fine tuned to a speaker, it produces significantly better quality in most of the cases while it only uses less than $40%$ of the speaker's data used to build the single-speaker model. In cross-lingual synthesis, on average, the generated quality is within $80%$ of native single-speaker models, in terms of Mean Opinion Score.

【3】 Look Who's Talking: Active Speaker Detection in the Wild 标题:看看谁在说话:野外活跃的说话人检测 链接:https://arxiv.org/abs/2108.07640

作者:You Jin Kim,Hee-Soo Heo,Soyeon Choe,Soo-Whan Chung,Yoohwan Kwon,Bong-Jin Lee,Youngki Kwon,Joon Son Chung 机构:Naver Corporation, South Korea 备注:To appear in Interspeech 2021. Data will be available from this https URL 摘要:在这项工作中,我们提出了一种新的用于野外主动说话人检测的视听数据集。当可以看到说话人的脸,同时可以听到声音时,说话人被认为是活跃的。尽管主动说话人检测是许多视听任务的关键预处理步骤,但目前还没有自然语音数据集来评估主动说话人检测的性能。因此,我们在野生(ASW)数据集中管理活动说话人,该数据集包含视频和带有密集语音活动标签的共生语音片段。视频和音频片段的时间戳是从VoxConverse解析和采用的,VoxConverse是一个现有的说话人日记数据集,由野外视频组成。从视频中提取人脸轨迹,并基于VoxConverse的时间戳以半自动方式对活动片段进行注释。在数据集上评估了两个参考系统,一个自监督系统和一个全监督系统,以提供ASW的基线性能。跨域评估是为了显示训练数据中配音视频的负面影响。 摘要:In this work, we present a novel audio-visual dataset for active speaker detection in the wild. A speaker is considered active when his or her face is visible and the voice is audible simultaneously. Although active speaker detection is a crucial pre-processing step for many audio-visual tasks, there is no existing dataset of natural human speech to evaluate the performance of active speaker detection. We therefore curate the Active Speakers in the Wild (ASW) dataset which contains videos and co-occurring speech segments with dense speech activity labels. Videos and timestamps of audible segments are parsed and adopted from VoxConverse, an existing speaker diarisation dataset that consists of videos in the wild. Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way. Two reference systems, a self-supervised system and a fully supervised one, are evaluated on the dataset to provide the baseline performances of ASW. Cross-domain evaluation is conducted in order to show the negative effect of dubbed videos in the training data.

【4】 Neonatal Bowel Sound Detection Using Convolutional Neural Network and Laplace Hidden Semi-Markov Model 标题:基于卷积神经网络和拉普拉斯隐半马尔可夫模型的新生儿肠音检测 链接:https://arxiv.org/abs/2108.07467

作者:Chiranjibi Sitaula,Jinyuan He,Archana Priyadarshi,Mark Tracy,Omid Kavehei,Murray Hinder,Anusha Withana,Alistair McEwan,Faezeh Marzbanrad 机构: Hinder are with Department of Paediatricsand Child Health at The University of Sydney and Westmead HospitalO, McEwan are with School of Biomedical Engineering 备注:Under review in IEEE/ACM Transactions on Audio Speech and Language Processing journal 摘要:腹部听诊是评估肠道状况的一种方便、安全和廉价的方法,在新生儿护理中至关重要。它有助于早期发现新生儿肠道功能障碍,并允许及时干预。本文介绍了一种新生儿肠音检测方法,以协助听诊。具体而言,卷积神经网络(CNN)被提出用于分类蠕动和非蠕动声音。然后使用拉普拉斯隐半马尔可夫模型(HSMM)对分类进行优化。我们的第三级新生儿重症监护病房(NICU)收治的49名新生儿的腹部声音验证了所提出的方法。结果表明,该方法能有效地检测肠鸣音,准确率和曲线下面积(AUC)得分分别为89.81%和83.96%,优于13种基线方法。此外,提出的拉普拉斯HSMM细化策略被证明能够增强其他肠道声音检测模型。这项工作的成果有可能促进未来远程保健在新生儿护理中的应用。我们工作的源代码可在以下网址找到:https://bitbucket.org/chirudeakin/neonatal-bowel-sound-classification/ 摘要:Abdominal auscultation is a convenient, safe and inexpensive method to assess bowel conditions, which is essential in neonatal care. It helps early detection of neonatal bowel dysfunctions and allows timely intervention. This paper presents a neonatal bowel sound detection method to assist the auscultation. Specifically, a Convolutional Neural Network (CNN) is proposed to classify peristalsis and non-peristalsis sounds. The classification is then optimized using a Laplace Hidden Semi-Markov Model (HSMM). The proposed method is validated on abdominal sounds from 49 newborn infants admitted to our tertiary Neonatal Intensive Care Unit (NICU). The results show that the method can effectively detect bowel sounds with accuracy and area under curve (AUC) score being 89.81% and 83.96% respectively, outperforming 13 baseline methods. Furthermore, the proposed Laplace HSMM refinement strategy is proven capable to enhance other bowel sound detection models. The outcomes of this work have the potential to facilitate future telehealth applications for neonatal care. The source code of our work can be found at: https://bitbucket.org/chirudeakin/neonatal-bowel-sound-classification/

【5】 DeepEigen: Learning-based Modal Sound Synthesis with Acoustic Transfer Maps 标题:深度特征:基于学习的声学传递图模态声音合成 链接:https://arxiv.org/abs/2108.07425

作者:Xutong Jin,Sheng Li,Dinesh Manocha,Guoping Wang 机构:Peking University, University of Maryland 摘要:我们提出了一种新的基于学习的方法来计算任意固体物体的本征模和声传输数据。我们的方法结合了两种基于网络的解决方案,形成了一个完整的基于学习的三维模态声音模型。这包括作为特征分解解算器的3D稀疏卷积网络和用于预测远场声传递图(FFAT图)的编码器-解码器网络。我们使用我们的方法来计算任意形状物体的振动模式(本征模式)和每种模式(声学数据)的FFAT映射(交互速率),而无需任何新物体的预计算数据集。我们的实验结果证明了我们方法的有效性和优点。我们将其准确性和效率与基于物理的声音合成方法进行了比较。 摘要:We present a novel learning-based approach to compute the eigenmodes and acoustic transfer data for the sound synthesis of arbitrary solid objects. Our approach combines two network-based solutions to formulate a complete learning-based 3D modal sound model. This includes a 3D sparse convolution network as the eigendecomposition solver and an encoder-decoder network for the prediction of the Far-Field Acoustic Transfer maps (FFAT Maps). We use our approach to compute the vibration modes (eigenmodes) and FFAT maps for each mode (acoustic data) for arbitrary-shaped objects at interactive rates without any precomputed dataset for any new object. Our experimental results demonstrate the effectiveness and benefits of our approach. We compare its accuracy and efficiency with physically-based sound synthesis methods.

【6】 Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation 标题:单声道语音去混响和噪声混响说话人分离的卷积预测 链接:https://arxiv.org/abs/2108.07376

作者:Zhong-Qiu Wang,Gordon Wichern,Jonathan Le Roux 备注:16 pages, 4 figures 摘要:基于监督学习的语音去冗余方法是一种很有前途的方法,该方法通过训练深层神经网络(DNN)来预测混响语音中的直接声音。这种数据驱动方法基于干净语音模式的先验知识,没有明确利用混响中的线性滤波器结构,即,混响是由室内脉冲响应(RIR)和干源信号之间的线性卷积产生的。在这项工作中,我们建议在基于深度学习的单耳语音去冗余框架中利用这种线性滤波器结构。关键思想是首先使用DNN估计目标扬声器的直接路径信号,然后识别衰减的信号和估计的直接路径信号的延迟副本,因为这些可以可靠地视为混响。它们可以直接删除以消除冗余,也可以用作其他DNN的额外功能以执行更好的消除冗余。为了识别拷贝,我们通过有效地解决时频域中每个频率的线性回归问题来估计底层滤波器(或RIR)。然后,我们修改了在混响和噪声混响条件下的说话人分离算法。在混响、SMS-WSJ和WHAMR上获得最先进的语音去冗余和说话人分离结果!数据集。 摘要:A promising approach for speech dereverberation is based on supervised learning, where a deep neural network (DNN) is trained to predict the direct sound from noisy-reverberant speech. This data-driven approach is based on leveraging prior knowledge of clean speech patterns and does not explicitly exploit the linear-filter structure in reverberation, i.e., that reverberation results from a linear convolution between a room impulse response (RIR) and a dry source signal. In this work, we propose to exploit this linear-filter structure within a deep learning based monaural speech dereverberation framework. The key idea is to first estimate the direct-path signal of the target speaker using a DNN and then identify signals that are decayed and delayed copies of the estimated direct-path signal, as these can be reliably considered as reverberation. They can be either directly removed for dereverberation, or used as extra features for another DNN to perform better dereverberation. To identify the copies, we estimate the underlying filter (or RIR) by efficiently solving a linear regression problem per frequency in the time-frequency domain. We then modify the proposed algorithm for speaker separation in reverberant and noisy-reverberant conditions. State-of-the-art speech dereverberation and speaker separation results are obtained on the REVERB, SMS-WSJ, and WHAMR! datasets.

【7】 Precision and accuracy of acoustic gunshot location in an urban environment 标题:城市环境中声学枪击定位的精确度和准确度 链接:https://arxiv.org/abs/2108.07377

作者:Robert B. Calhoun,Clark Dunson,Murphey L. Johnson,Scott R. Lamkin,William R. Lewis,Robert L. Showen,Mark A. Sompel,Lester P. Wollman 机构:ShotSpotter, Inc., Newark, CA , USA 摘要:枪口爆炸是由枪械的发射引起的,它会产生一种响亮的、脉冲式的声音,这种声音会从射手身上向各个方向传播。源的位置可以通过在已知位置的多个声学传感器上测量炮口爆炸的到达时间来计算,这种技术称为多面定位。通过假设均匀介质中的直线传播(该模型有多个已发表的解决方案),多边化问题大大简化。在几种算法和几何约束下,对宾夕法尼亚州匹兹堡的ShotSpotter枪弹定位系统的实弹射击测试进行了离线分析,以评估法医背景下声学多向定位的准确性。在二维几何约束下,使用Mathias、Leonari和Galati提出的算法可获得最佳结果。对参与传感器阵列的随机子集进行的多边性分析表明,当六个或更多传感器参与解决方案时,96%的放炮可以定位到15米或更高的精度。 摘要:The muzzle blast caused by the discharge of a firearm generates a loud, impulsive sound that propagates away from the shooter in all directions. The location of the source can be computed from time-of-arrival measurements of the muzzle blast on multiple acoustic sensors at known locations, a technique known as multilateration. The multilateration problem is considerably simplified by assuming straight-line propagation in a homogeneous medium, a model for which there are multiple published solutions. Live-fire tests of the ShotSpotter gunshot location system in Pittsburgh, PA were analyzed off-line under several algorithms and geometric constraints to evaluate the accuracy of acoustic multilateration in a forensic context. Best results were obtained using the algorithm due to Mathias, Leonari and Galati under a two-dimensional geometric constraint. Multilateration on random subsets of the participating sensor array show that 96% of shots can be located to an accuracy of 15 m or better when six or more sensors participate in the solution.

3.eess.AS音频处理:

【1】 Precision and accuracy of acoustic gunshot location in an urban environment 标题:城市环境中声学枪击定位的精确度和准确度 链接:https://arxiv.org/abs/2108.07377

作者:Robert B. Calhoun,Clark Dunson,Murphey L. Johnson,Scott R. Lamkin,William R. Lewis,Robert L. Showen,Mark A. Sompel,Lester P. Wollman 机构:ShotSpotter, Inc., Newark, CA , USA 摘要:枪口爆炸是由枪械的发射引起的,它会产生一种响亮的、脉冲式的声音,这种声音会从射手身上向各个方向传播。源的位置可以通过在已知位置的多个声学传感器上测量炮口爆炸的到达时间来计算,这种技术称为多面定位。通过假设均匀介质中的直线传播(该模型有多个已发表的解决方案),多边化问题大大简化。在几种算法和几何约束下,对宾夕法尼亚州匹兹堡的ShotSpotter枪弹定位系统的实弹射击测试进行了离线分析,以评估法医背景下声学多向定位的准确性。在二维几何约束下,使用Mathias、Leonari和Galati提出的算法可获得最佳结果。对参与传感器阵列的随机子集进行的多边性分析表明,当六个或更多传感器参与解决方案时,96%的放炮可以定位到15米或更高的精度。 摘要:The muzzle blast caused by the discharge of a firearm generates a loud, impulsive sound that propagates away from the shooter in all directions. The location of the source can be computed from time-of-arrival measurements of the muzzle blast on multiple acoustic sensors at known locations, a technique known as multilateration. The multilateration problem is considerably simplified by assuming straight-line propagation in a homogeneous medium, a model for which there are multiple published solutions. Live-fire tests of the ShotSpotter gunshot location system in Pittsburgh, PA were analyzed off-line under several algorithms and geometric constraints to evaluate the accuracy of acoustic multilateration in a forensic context. Best results were obtained using the algorithm due to Mathias, Leonari and Galati under a two-dimensional geometric constraint. Multilateration on random subsets of the participating sensor array show that 96% of shots can be located to an accuracy of 15 m or better when six or more sensors participate in the solution.

【2】 Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition 标题:采用GPT、GPT-2和BERT语言模型进行语音识别 链接:https://arxiv.org/abs/2108.07789

作者:Xianrui Zheng,Chao Zhang,Philip C. Woodland 机构:Cambridge University Engineering Dept., Trumpington St., Cambridge, CB,PZ U.K. 摘要:语言模型(LMs)在大量文本上预先训练,特别是来自Transformers(BERT)、generative pre training(GPT)和GPT-2的双向编码器表示,已经成为许多自然语言处理任务的关键技术。在本文中,我们介绍了使用微调GPT、GPT-2及其组合进行自动语音识别(ASR)的结果。与单向LM-GPT和GPT-2不同,BERT是双向的,其输出概率的直接乘积不再是有效的语言先验概率。提出了一种转换方法,以数学精确的方式计算基于双向LM输出的正确语言先验概率。在广泛使用的AMI和交换机ASR任务上的实验结果表明,微调GPT和GPT-2的组合优于三种不同结构的神经LMs的组合,在域内文本上从头开始训练,相对字错误率降低(WERR)高达12%。此外,建议的语言先验概率转换使BERT能够接收额外的3%相对WERR,并且BERT、GPT和GPT-2的组合导致进一步的改进。 摘要:Language models (LMs) pre-trained on massive amounts of text, in particular bidirectional encoder representations from Transformers (BERT), generative pre-training (GPT), and GPT-2, have become a key technology for many natural language processing tasks. In this paper, we present results using fine-tuned GPT, GPT-2, and their combination for automatic speech recognition (ASR). Unlike unidirectional LM GPT and GPT-2, BERT is bidirectional whose direct product of the output probabilities is no longer a valid language prior probability. A conversion method is proposed to compute the correct language prior probability based on bidirectional LM outputs in a mathematically exact way. Experimental results on the widely used AMI and Switchboard ASR tasks showed that the combination of the fine-tuned GPT and GPT-2 outperformed the combination of three neural LMs with different architectures trained from scratch on the in-domain text by up to a 12% relative word error rate reduction (WERR). Furthermore, the proposed conversion for language prior probabilities enables BERT to receive an extra 3% relative WERR, and the combination of BERT, GPT and GPT-2 results in further improvements.

【3】 Combining speakers of multiple languages to improve quality of neural voices 标题:组合多种语言的说话人以提高神经语音的质量 链接:https://arxiv.org/abs/2108.07737

作者:Javier Latorre,Charlotte Bailleul,Tuuli Morrill,Alistair Conkie,Yannis Stylianou 机构:Apple 备注:6 pages. 3 figures. Accepted to 11th Speech Synthesis Workshop, SSW11 (this https URL) 摘要:在这项工作中,我们探索了开发多说话人和多语言神经TTS系统的多种架构和训练程序,目标是a)在目标语言可用数据有限时提高质量,b)实现跨语言合成。我们报告了一项大型实验的结果,该实验使用了来自15个不同地区的8种不同语言的30名说话者。系统在每个说话人的相同数据量上进行训练。与单扬声器模型相比,当建议的系统对扬声器进行微调时,在大多数情况下,其产生的质量显著提高,而用于构建单扬声器模型的扬声器数据仅使用不到$40%$。在跨语言合成中,就平均意见得分而言,生成的质量在母语单语者模型的80%以内。 摘要:In this work, we explore multiple architectures and training procedures for developing a multi-speaker and multi-lingual neural TTS system with the goals of a) improving the quality when the available data in the target language is limited and b) enabling cross-lingual synthesis. We report results from a large experiment using 30 speakers in 8 different languages across 15 different locales. The system is trained on the same amount of data per speaker. Compared to a single-speaker model, when the suggested system is fine tuned to a speaker, it produces significantly better quality in most of the cases while it only uses less than $40%$ of the speaker's data used to build the single-speaker model. In cross-lingual synthesis, on average, the generated quality is within $80%$ of native single-speaker models, in terms of Mean Opinion Score.

【4】 Look Who's Talking: Active Speaker Detection in the Wild 标题:看看谁在说话:野外活跃的说话人检测 链接:https://arxiv.org/abs/2108.07640

作者:You Jin Kim,Hee-Soo Heo,Soyeon Choe,Soo-Whan Chung,Yoohwan Kwon,Bong-Jin Lee,Youngki Kwon,Joon Son Chung 机构:Naver Corporation, South Korea 备注:To appear in Interspeech 2021. Data will be available from this https URL 摘要:在这项工作中,我们提出了一种新的用于野外主动说话人检测的视听数据集。当可以看到说话人的脸,同时可以听到声音时,说话人被认为是活跃的。尽管主动说话人检测是许多视听任务的关键预处理步骤,但目前还没有自然语音数据集来评估主动说话人检测的性能。因此,我们在野生(ASW)数据集中管理活动说话人,该数据集包含视频和带有密集语音活动标签的共生语音片段。视频和音频片段的时间戳是从VoxConverse解析和采用的,VoxConverse是一个现有的说话人日记数据集,由野外视频组成。从视频中提取人脸轨迹,并基于VoxConverse的时间戳以半自动方式对活动片段进行注释。在数据集上评估了两个参考系统,一个自监督系统和一个全监督系统,以提供ASW的基线性能。跨域评估是为了显示训练数据中配音视频的负面影响。 摘要:In this work, we present a novel audio-visual dataset for active speaker detection in the wild. A speaker is considered active when his or her face is visible and the voice is audible simultaneously. Although active speaker detection is a crucial pre-processing step for many audio-visual tasks, there is no existing dataset of natural human speech to evaluate the performance of active speaker detection. We therefore curate the Active Speakers in the Wild (ASW) dataset which contains videos and co-occurring speech segments with dense speech activity labels. Videos and timestamps of audible segments are parsed and adopted from VoxConverse, an existing speaker diarisation dataset that consists of videos in the wild. Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way. Two reference systems, a self-supervised system and a fully supervised one, are evaluated on the dataset to provide the baseline performances of ASW. Cross-domain evaluation is conducted in order to show the negative effect of dubbed videos in the training data.

【5】 Neonatal Bowel Sound Detection Using Convolutional Neural Network and Laplace Hidden Semi-Markov Model 标题:基于卷积神经网络和拉普拉斯隐半马尔可夫模型的新生儿肠音检测 链接:https://arxiv.org/abs/2108.07467

作者:Chiranjibi Sitaula,Jinyuan He,Archana Priyadarshi,Mark Tracy,Omid Kavehei,Murray Hinder,Anusha Withana,Alistair McEwan,Faezeh Marzbanrad 机构: Hinder are with Department of Paediatricsand Child Health at The University of Sydney and Westmead HospitalO, McEwan are with School of Biomedical Engineering 备注:Under review in IEEE/ACM Transactions on Audio Speech and Language Processing journal 摘要:腹部听诊是评估肠道状况的一种方便、安全和廉价的方法,在新生儿护理中至关重要。它有助于早期发现新生儿肠道功能障碍,并允许及时干预。本文介绍了一种新生儿肠音检测方法,以协助听诊。具体而言,卷积神经网络(CNN)被提出用于分类蠕动和非蠕动声音。然后使用拉普拉斯隐半马尔可夫模型(HSMM)对分类进行优化。我们的第三级新生儿重症监护病房(NICU)收治的49名新生儿的腹部声音验证了所提出的方法。结果表明,该方法能有效地检测肠鸣音,准确率和曲线下面积(AUC)得分分别为89.81%和83.96%,优于13种基线方法。此外,提出的拉普拉斯HSMM细化策略被证明能够增强其他肠道声音检测模型。这项工作的成果有可能促进未来远程保健在新生儿护理中的应用。我们工作的源代码可在以下网址找到:https://bitbucket.org/chirudeakin/neonatal-bowel-sound-classification/ 摘要:Abdominal auscultation is a convenient, safe and inexpensive method to assess bowel conditions, which is essential in neonatal care. It helps early detection of neonatal bowel dysfunctions and allows timely intervention. This paper presents a neonatal bowel sound detection method to assist the auscultation. Specifically, a Convolutional Neural Network (CNN) is proposed to classify peristalsis and non-peristalsis sounds. The classification is then optimized using a Laplace Hidden Semi-Markov Model (HSMM). The proposed method is validated on abdominal sounds from 49 newborn infants admitted to our tertiary Neonatal Intensive Care Unit (NICU). The results show that the method can effectively detect bowel sounds with accuracy and area under curve (AUC) score being 89.81% and 83.96% respectively, outperforming 13 baseline methods. Furthermore, the proposed Laplace HSMM refinement strategy is proven capable to enhance other bowel sound detection models. The outcomes of this work have the potential to facilitate future telehealth applications for neonatal care. The source code of our work can be found at: https://bitbucket.org/chirudeakin/neonatal-bowel-sound-classification/

【6】 DeepEigen: Learning-based Modal Sound Synthesis with Acoustic Transfer Maps 标题:深度特征:基于学习的声学传递图模态声音合成 链接:https://arxiv.org/abs/2108.07425

作者:Xutong Jin,Sheng Li,Dinesh Manocha,Guoping Wang 机构:Peking University, University of Maryland 摘要:我们提出了一种新的基于学习的方法来计算任意固体物体的本征模和声传输数据。我们的方法结合了两种基于网络的解决方案,形成了一个完整的基于学习的三维模态声音模型。这包括作为特征分解解算器的3D稀疏卷积网络和用于预测远场声传递图(FFAT图)的编码器-解码器网络。我们使用我们的方法来计算任意形状物体的振动模式(本征模式)和每种模式(声学数据)的FFAT映射(交互速率),而无需任何新物体的预计算数据集。我们的实验结果证明了我们方法的有效性和优点。我们将其准确性和效率与基于物理的声音合成方法进行了比较。 摘要:We present a novel learning-based approach to compute the eigenmodes and acoustic transfer data for the sound synthesis of arbitrary solid objects. Our approach combines two network-based solutions to formulate a complete learning-based 3D modal sound model. This includes a 3D sparse convolution network as the eigendecomposition solver and an encoder-decoder network for the prediction of the Far-Field Acoustic Transfer maps (FFAT Maps). We use our approach to compute the vibration modes (eigenmodes) and FFAT maps for each mode (acoustic data) for arbitrary-shaped objects at interactive rates without any precomputed dataset for any new object. Our experimental results demonstrate the effectiveness and benefits of our approach. We compare its accuracy and efficiency with physically-based sound synthesis methods.

【7】 Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation 标题:单声道语音去混响和噪声混响说话人分离的卷积预测 链接:https://arxiv.org/abs/2108.07376

作者:Zhong-Qiu Wang,Gordon Wichern,Jonathan Le Roux 备注:16 pages, 4 figures 摘要:基于监督学习的语音去冗余方法是一种很有前途的方法,该方法通过训练深层神经网络(DNN)来预测混响语音中的直接声音。这种数据驱动方法基于干净语音模式的先验知识,没有明确利用混响中的线性滤波器结构,即,混响是由室内脉冲响应(RIR)和干源信号之间的线性卷积产生的。在这项工作中,我们建议在基于深度学习的单耳语音去冗余框架中利用这种线性滤波器结构。关键思想是首先使用DNN估计目标扬声器的直接路径信号,然后识别衰减的信号和估计的直接路径信号的延迟副本,因为这些可以可靠地视为混响。它们可以直接删除以消除冗余,也可以用作其他DNN的额外功能以执行更好的消除冗余。为了识别拷贝,我们通过有效地解决时频域中每个频率的线性回归问题来估计底层滤波器(或RIR)。然后,我们修改了在混响和噪声混响条件下的说话人分离算法。在混响、SMS-WSJ和WHAMR上获得最先进的语音去冗余和说话人分离结果!数据集。 摘要:A promising approach for speech dereverberation is based on supervised learning, where a deep neural network (DNN) is trained to predict the direct sound from noisy-reverberant speech. This data-driven approach is based on leveraging prior knowledge of clean speech patterns and does not explicitly exploit the linear-filter structure in reverberation, i.e., that reverberation results from a linear convolution between a room impulse response (RIR) and a dry source signal. In this work, we propose to exploit this linear-filter structure within a deep learning based monaural speech dereverberation framework. The key idea is to first estimate the direct-path signal of the target speaker using a DNN and then identify signals that are decayed and delayed copies of the estimated direct-path signal, as these can be reliably considered as reverberation. They can be either directly removed for dereverberation, or used as extra features for another DNN to perform better dereverberation. To identify the copies, we estimate the underlying filter (or RIR) by efficiently solving a linear regression problem per frequency in the time-frequency domain. We then modify the proposed algorithm for speaker separation in reverberant and noisy-reverberant conditions. State-of-the-art speech dereverberation and speaker separation results are obtained on the REVERB, SMS-WSJ, and WHAMR! datasets.

0 人点赞