自然语言处理学术速递[7.26]

2021-07-27 11:18:25 浏览数 (1)

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

cs.CL 方向,今日共计13篇

Transformer(1篇)

【1】 Did the Cat Drink the Coffee? Challenging Transformers with Generalized Event Knowledge 标题:猫喝咖啡了吗?用广义事件知识挑战Transformer

作者:Paolo Pedinotti,Giulia Rambelli,Emmanuele Chersoni,Enrico Santus,Alessandro Lenci,Philippe Blache 机构:University of Pisa - Aix-Marseille University, The Hong Kong Polytechnic University, Bayer Pharmaceuticals 链接:https://arxiv.org/abs/2107.10922 摘要:先前的研究探讨了计算模型预测一个词与给定谓词语义匹配的能力。虽然很多工作都致力于孤立地模拟动词和论元之间的典型关系,在本文中,我们从一个更广阔的视角来评估计算方法是否以及在多大程度上能够获得语言描述的整个事件和情境的典型性信息(广义事件知识)。考虑到Transformers语言模型(TLM)最近的成功,我们决定在一个textit{主题匹配的动态估计}基准上测试它们。对这些模型的评估是与SDM(一个专门设计用来整合句子意义表征中事件的框架)进行比较的,我们进行了详细的错误分析来调查哪些因素影响了他们的行为。结果表明,TLMs可以达到与SDM相当的性能。然而,额外的分析一致地表明,tlm不能捕捉事件知识的重要方面,并且它们的预测通常依赖于表面语言特征,例如频繁词、搭配和句法模式,从而显示出次优的泛化能力。 摘要:Prior research has explored the ability of computational models to predict a word semantic fit with a given predicate. While much work has been devoted to modeling the typicality relation between verbs and arguments in isolation, in this paper we take a broader perspective by assessing whether and to what extent computational approaches have access to the information about the typicality of entire events and situations described in language (Generalized Event Knowledge). Given the recent success of Transformers Language Models (TLMs), we decided to test them on a benchmark for the textit{dynamic estimation of thematic fit}. The evaluation of these models was performed in comparison with SDM, a framework specifically designed to integrate events in sentence meaning representations, and we conducted a detailed error analysis to investigate which factors affect their behavior. Our results show that TLMs can reach performances that are comparable to those achieved by SDM. However, additional analysis consistently suggests that TLMs do not capture important aspects of event knowledge, and their predictions often depend on surface linguistic features, such as frequent words, collocations and syntactic patterns, thereby showing sub-optimal generalization abilities.

机器翻译(1篇)

【1】 Modeling Bilingual Conversational Characteristics for Neural Chat Translation 标题:面向神经聊天翻译的双语会话特征建模

作者:Yunlong Liang,Fandong Meng,Yufeng Chen,Jinan Xu,Jie Zhou 机构:Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, China, Pattern Recognition Center, WeChat AI, Tencent Inc, China 备注:Accepted as a long paper at ACL 2021. Code and data are available at this https URL 链接:https://arxiv.org/abs/2107.11164 摘要:神经聊天翻译旨在翻译双语会话文本,在国际交流与合作中有着广泛的应用。尽管句子级和上下文感知的神经机器翻译(NMT)有着令人印象深刻的表现,但由于其角色偏好、对话连贯性和翻译一致性等固有特征,双语会话文本的翻译仍然存在挑战。本文旨在通过对上述特征的建模,提高会话文本的翻译质量。具体来说,我们设计了三个潜在的变分模块来学习双语会话特征的分布。通过从这些学习分布中抽取样本,为角色偏好、对话连贯性和翻译一致性量身定做的潜在变量被纳入NMT模型中,以便更好地进行翻译。我们在基准数据集BConTrasT(英语-德语)和自行收集的双语对话语料库BMELD(英语-汉语)上评估了我们的方法。大量的实验表明,我们的方法显著地提高了强基线的性能,并且在BLEU和TER方面显著地超过了一些最先进的上下文感知NMT模型。此外,我们还将BMELD数据集向研究社区公开。 摘要:Neural chat translation aims to translate bilingual conversational text, which has a broad application in international exchanges and cooperation. Despite the impressive performance of sentence-level and context-aware Neural Machine Translation (NMT), there still remain challenges to translate bilingual conversational text due to its inherent characteristics such as role preference, dialogue coherence, and translation consistency. In this paper, we aim to promote the translation quality of conversational text by modeling the above properties. Specifically, we design three latent variational modules to learn the distributions of bilingual conversational characteristics. Through sampling from these learned distributions, the latent variables, tailored for role preference, dialogue coherence, and translation consistency, are incorporated into the NMT model for better translation. We evaluate our approach on the benchmark dataset BConTrasT (English-German) and a self-collected bilingual dialogue corpus, named BMELD (English-Chinese). Extensive experiments show that our approach notably boosts the performance over strong baselines by a large margin and significantly surpasses some state-of-the-art context-aware NMT models in terms of BLEU and TER. Additionally, we make the BMELD dataset publicly available for the research community.

Graph|知识图谱|Knowledge(2篇)

【1】 Powering Effective Climate Communication with a Climate Knowledge Base 标题:利用气候知识库为有效的气候交流提供动力

作者:Kameron B. Rodrigues,Shweta Khushu,Mukut Mukherjee,Andrew Banister,Anthony Hevia,Sampath Duddu,Nikita Bhutani 机构:Another critical reason for the lack of motivation for action 1Stanford University, USA 3University ofIllinois 链接:https://arxiv.org/abs/2107.11351 摘要:虽然许多人接受气候变化及其日益增长的影响,但很少有人能很好地谈论气候变化,从而限制了应对气候变化所需的社会变革的采纳速度。为了使有效的气候交流更容易,我们的目标是建立一个系统,向任何个人提供预测的气候信息,以最好地激励和激励他们采取行动,因为他们有自己独特的个人价值观。为了缓解冷启动问题,该系统依赖于气候变化原因和影响的知识库(ClimateKB)及其与个人价值观的关联。由于没有这样全面的ClimateKB存在,我们重新研究了知识库的构建技术,并从自由文本中构建了ClimateKB。我们计划开源ClimateKB和相关代码,以鼓励未来的研究和应用。 摘要:While many accept climate change and its growing impacts, few converse about it well, limiting the adoption speed of societal changes necessary to address it. In order to make effective climate communication easier, we aim to build a system that presents to any individual the climate information predicted to best motivate and inspire them to take action given their unique set of personal values. To alleviate the cold-start problem, the system relies on a knowledge base (ClimateKB) of causes and effects of climate change, and their associations to personal values. Since no such comprehensive ClimateKB exists, we revisit knowledge base construction techniques and build a ClimateKB from free text. We plan to open source the ClimateKB and associated code to encourage future research and applications.

【2】 Graph-Based Learning for Stock Movement Prediction with Textual and Relational Data 标题:基于图的文本和关系数据股市走势预测学习

作者:Qinkai Chen,Christian-Yann Robert 机构:†Ecole Polytechnique, Palaiseau, France, ‡Exoduspoint Capital Management France, Paris, France, §ENSAE Paris, Palaiseau, France 备注:10 pages, 3 figures, 5 tables 链接:https://arxiv.org/abs/2107.10941 摘要:由于市场的不确定性和从机器的角度理解自然语言的困难,从文本信息预测股票价格是一项具有挑战性的任务。以往的研究主要集中在基于单个新闻的情感提取上。然而,金融市场上的股票可以高度相关,一个关于一只股票的消息可以迅速影响其他股票的价格。为此,本文提出了一种新的股票运动预测框架:多图回归股票预测网络(MGRN)。该体系结构允许将来自金融新闻的文本情感和从其他金融数据中提取的多个关系信息结合起来。通过对斯托克欧洲600指数(STOXX europe600)的精度检验和交易模拟,我们证明了我们的模型比其他基准指数有更好的表现。 摘要:Predicting stock prices from textual information is a challenging task due to the uncertainty of the market and the difficulty understanding the natural language from a machine's perspective. Previous researches focus mostly on sentiment extraction based on single news. However, the stocks on the financial market can be highly correlated, one news regarding one stock can quickly impact the prices of other stocks. To take this effect into account, we propose a new stock movement prediction framework: Multi-Graph Recurrent Network for Stock Forecasting (MGRN). This architecture allows to combine the textual sentiment from financial news and multiple relational information extracted from other financial data. Through an accuracy test and a trading simulation on the stocks in the STOXX Europe 600 index, we demonstrate a better performance from our model than other benchmarks.

推理|分析|理解|解释(1篇)

【1】 When a crisis strikes: Emotion analysis and detection during COVID-19 标题:危机来袭:冠状病毒期间的情绪分析和检测

作者:Alexander Tekle,Chau Pham,Cornelia Caragea,Junyi Jessy Li 机构:Electrical and Computer Engineering, The University of Texas at Austin, Computer Science, Colgate University, University of Illinois at Chicago, Linguistics 链接:https://arxiv.org/abs/2107.11020 摘要:自然灾害、全球流行病和社会动荡等危机不断威胁着我们的世界,并以不同的方式影响着全世界数以百万计的人。了解人们在大规模危机中表达的情绪,有助于向决策者和第一反应者通报民众的情绪状态,并为那些需要这种支持的人提供情绪支持。我们提供CovidEmo,约1K条贴着情感标签的推文。在COVID-19的背景下,我们检验了大规模预先训练的语言模型在感知情绪预测任务中跨领域和危机的泛化程度。我们的结果表明,现有的模型不会直接从一种灾难类型转移到另一种灾难类型,但使用标记的情绪语料库进行领域适应是有益的。 摘要:Crises such as natural disasters, global pandemics, and social unrest continuously threaten our world and emotionally affect millions of people worldwide in distinct ways. Understanding emotions that people express during large-scale crises helps inform policy makers and first responders about the emotional states of the population as well as provide emotional support to those who need such support. We present CovidEmo, ~1K tweets labeled with emotions. We examine how well large pre-trained language models generalize across domains and crises in the task of perceived emotion prediction in the context of COVID-19. Our results show that existing models do not directly transfer from one disaster type to another but using labeled emotional corpora for domain adaptation is beneficial.

GAN|对抗|攻击|生成相关(3篇)

【1】 A Differentiable Language Model Adversarial Attack on Text Classifiers 标题:一种可区分语言模型对文本分类器的对抗性攻击

作者:Ivan Fursov,Alexey Zaytsev,Pavel Burnyshev,Ekaterina Dmitrieva,Nikita Klyuchnikov,Andrey Kravchenko,Ekaterina Artemova,Evgeny Burnaev 机构: Skolkovo Institute of Science and Technology, Huawei Noah’s Ark lab, HSE University, Department of Computer Science, Oxford University 备注:arXiv admin note: substantial text overlap with arXiv:2006.11078 链接:https://arxiv.org/abs/2107.11275 摘要:基于大型Transformer的自然语言处理模型的健壮性是一个重要的问题,因为它们的能力和广泛采用。理解和提高这些模型健壮性的一种方法是探索对抗性攻击场景:检查输入的小扰动是否可以欺骗模型。由于文本数据的离散性,计算机视觉中广泛使用的基于梯度的对抗方法本身并不适用。克服这个问题的标准策略是开发标记级转换,它不考虑整个句子。本文提出了一种新的黑盒语句级攻击。我们的方法微调一个预先训练的语言模型来生成对抗性的例子。提出的可微损失函数依赖于一个替代分类器得分和一个通过深度学习模型计算的近似编辑距离。我们证明,在计算的度量和人的评估方面,所提出的攻击在不同的NLP问题上都优于竞争对手。此外,由于使用了微调的语言模型,生成的对抗性例子很难被检测到,因此现有的模型并不健壮。因此,很难抵御提议的攻击,而其他攻击则不然。 摘要:Robustness of huge Transformer-based models for natural language processing is an important issue due to their capabilities and wide adoption. One way to understand and improve robustness of these models is an exploration of an adversarial attack scenario: check if a small perturbation of an input can fool a model. Due to the discrete nature of textual data, gradient-based adversarial methods, widely used in computer vision, are not applicable per~se. The standard strategy to overcome this issue is to develop token-level transformations, which do not take the whole sentence into account. In this paper, we propose a new black-box sentence-level attack. Our method fine-tunes a pre-trained language model to generate adversarial examples. A proposed differentiable loss function depends on a substitute classifier score and an approximate edit distance computed via a deep learning model. We show that the proposed attack outperforms competitors on a diverse set of NLP problems for both computed metrics and human evaluation. Moreover, due to the usage of the fine-tuned language model, the generated adversarial examples are hard to detect, thus current models are not robust. Hence, it is difficult to defend from the proposed attack, which is not the case for other attacks.

【2】 Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation 标题:用于鲁棒视觉语言导航的对抗性强化指令攻击者

作者:Bingqian Lin,Yi Zhu,Yanxin Long,Xiaodan Liang,Qixiang Ye,Liang Lin 备注:None 链接:https://arxiv.org/abs/2107.11252 摘要:语言教学在自然语言导航任务中起着至关重要的作用。然而,使用有限的人工注释指令训练的导航员在不同的时间步很难准确地从复杂指令中获取关键信息,导致导航性能较差。在本文中,我们利用对抗式攻击模式,训练一个能从长指令中动态提取关键因素的更健壮的导航器。具体地说,我们提出了一种动态增强指令攻击者(DR-Attacker),它通过在不同的时间步破坏指令中最有指导意义的信息来误导导航器移动到错误的目标。通过将扰动生成过程描述为Markov决策过程,利用强化学习算法对DR攻击者进行优化,使其在导航过程中根据可学习的攻击分数依次生成扰动指令。然后,利用扰动指令作为硬样本,通过有效的对抗训练策略和辅助自监督推理任务,提高导航器的鲁棒性。在视觉与语言导航(VLN)和对话历史导航(NDH)任务上的实验结果表明,该方法优于现有的方法。此外,可视化分析显示了所提出的DR攻击者的有效性,它可以在不同的时间步成功地攻击指令中的关键信息。代码位于https://github.com/expectorlin/DR-Attacker. 摘要:Language instruction plays an essential role in the natural language grounded navigation tasks. However, navigators trained with limited human-annotated instructions may have difficulties in accurately capturing key information from the complicated instruction at different timesteps, leading to poor navigation performance. In this paper, we exploit to train a more robust navigator which is capable of dynamically extracting crucial factors from the long instruction, by using an adversarial attacking paradigm. Specifically, we propose a Dynamic Reinforced Instruction Attacker (DR-Attacker), which learns to mislead the navigator to move to the wrong target by destroying the most instructive information in instructions at different timesteps. By formulating the perturbation generation as a Markov Decision Process, DR-Attacker is optimized by the reinforcement learning algorithm to generate perturbed instructions sequentially during the navigation, according to a learnable attack score. Then, the perturbed instructions, which serve as hard samples, are used for improving the robustness of the navigator with an effective adversarial training strategy and an auxiliary self-supervised reasoning task. Experimental results on both Vision-and-Language Navigation (VLN) and Navigation from Dialog History (NDH) tasks show the superiority of our proposed method over state-of-the-art methods. Moreover, the visualization analysis shows the effectiveness of the proposed DR-Attacker, which can successfully attack crucial information in the instructions at different timesteps. Code is available at https://github.com/expectorlin/DR-Attacker.

【3】 Unsupervised Domain Adaptation for Dysarthric Speech Detection via Domain Adversarial Training and Mutual Information Minimization 标题:基于域对抗训练和互信息最小化的无监督域自适应节律语音检测

作者:Disong Wang,Liqun Deng,Yu Ting Yeung,Xiao Chen,Xunying Liu,Helen Meng 机构:The Chinese University of Hong Kong, Hong Kong SAR, China, Huawei Noah’s Ark Lab 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2106.10127 摘要:构音障碍语音检测(DSD)系统旨在从语音中检测神经运动障碍的特征。这种系统特别容易受到域不匹配的影响,其中训练和测试数据分别来自源域和目标域,但这两个域在语音刺激、疾病病因等方面可能不同。由于注释大量数据集的高成本,很难获得目标域中的标记数据。本文首次尝试将跨域DSD描述为一个无监督域自适应(UDA)问题。我们使用标记的源域数据和未标记的目标域数据,提出了一种多任务学习策略,包括构音障碍存在分类(DPC)、域对抗训练(DAT)和互信息最小化(MIM),旨在学习构音障碍区分性和域不变的生物标记嵌入。具体来说,DPC有助于生物标记物嵌入捕获构音障碍的关键指标;DAT迫使生物标记嵌入在源域和靶域无法区分;MIM进一步降低了生物标记嵌入与领域相关线索之间的相关性。将UASPEECH和TORGO语料库分别作为源域和目标域,实验结果表明,UDA的加入在话语级加权平均回忆和说话人级准确率上分别获得了22.2%和20.0%的绝对提高。 摘要:Dysarthric speech detection (DSD) systems aim to detect characteristics of the neuromotor disorder from speech. Such systems are particularly susceptible to domain mismatch where the training and testing data come from the source and target domains respectively, but the two domains may differ in terms of speech stimuli, disease etiology, etc. It is hard to acquire labelled data in the target domain, due to high costs of annotating sizeable datasets. This paper makes a first attempt to formulate cross-domain DSD as an unsupervised domain adaptation (UDA) problem. We use labelled source-domain data and unlabelled target-domain data, and propose a multi-task learning strategy, including dysarthria presence classification (DPC), domain adversarial training (DAT) and mutual information minimization (MIM), which aim to learn dysarthria-discriminative and domain-invariant biomarker embeddings. Specifically, DPC helps biomarker embeddings capture critical indicators of dysarthria; DAT forces biomarker embeddings to be indistinguishable in source and target domains; and MIM further reduces the correlation between biomarker embeddings and domain-related cues. By treating the UASPEECH and TORGO corpora respectively as the source and target domains, experiments show that the incorporation of UDA attains absolute increases of 22.2% and 20.0% respectively in utterance-level weighted average recall and speaker-level accuracy.

其他神经网络|深度学习|模型|建模(3篇)

【1】 Modelling Latent Translations for Cross-Lingual Transfer 标题:跨语言迁移中的潜在翻译建模

作者:Edoardo Maria Ponti,Julia Kreutzer,Ivan Vulić,Siva Reddy 机构:Ivan Vuli´c, Mila – Quebec Artificial Intelligence Institute, McGill University, Google Research, University of Cambridge, Facebook CIFAR AI Chair 链接:https://arxiv.org/abs/2107.11353 摘要:虽然在多个任务和语言中取得了最先进的成果,但基于翻译的跨语言迁移往往被忽视,取而代之的是大量的多语言预训练编码器。可以说,这是由于它的主要局限性:1)翻译错误渗透到分类阶段,2)最大似然翻译的表达能力不足。为了解决这个问题,我们提出了一种新的技术,将传统管道的两个步骤(翻译和分类)集成到一个单一的模型中,将中间翻译作为一个潜在的随机变量。结果,1)可以用最小风险训练的变体对神经机器翻译系统进行微调,其中奖励是下游任务分类器的精度。此外,在推理过程中,可以抽取多个样本来近似所有可能翻译的预期损失。我们在一系列多语种自然语言学习任务(包括常识推理、释义识别和自然语言推理)的基础上评估了我们新的基于潜在翻译的模型。我们报告了零炮和少炮学习设置的增益,平均精度高达2.7点,这对于低资源语言(例如海地克里奥尔语)更为突出。最后,我们对不同的底层NMT模型进行了深入分析,并评估了替代翻译对下游性能的影响。 摘要:While achieving state-of-the-art results in multiple tasks and languages, translation-based cross-lingual transfer is often overlooked in favour of massively multilingual pre-trained encoders. Arguably, this is due to its main limitations: 1) translation errors percolating to the classification phase and 2) the insufficient expressiveness of the maximum-likelihood translation. To remedy this, we propose a new technique that integrates both steps of the traditional pipeline (translation and classification) into a single model, by treating the intermediate translations as a latent random variable. As a result, 1) the neural machine translation system can be fine-tuned with a variant of Minimum Risk Training where the reward is the accuracy of the downstream task classifier. Moreover, 2) multiple samples can be drawn to approximate the expected loss across all possible translations during inference. We evaluate our novel latent translation-based model on a series of multilingual NLU tasks, including commonsense reasoning, paraphrase identification, and natural language inference. We report gains for both zero-shot and few-shot learning setups, up to 2.7 accuracy points on average, which are even more prominent for low-resource languages (e.g., Haitian Creole). Finally, we carry out in-depth analyses comparing different underlying NMT models and assessing the impact of alternative translations on the downstream performance.

【2】 Improving Early Sepsis Prediction with Multi Modal Learning 标题:利用多模态学习提高脓毒症早期预测水平

作者:Fred Qin,Vivek Madan,Ujjwal Ratan,Zohar Karnin,Vishaal Kapoor,Parminder Bhatia,Taha Kass-Hout 链接:https://arxiv.org/abs/2107.11094 摘要:脓毒症是一种威胁生命的疾病,发病率、死亡率和医疗费用都很高。抗生素和静脉输液的早期预测和管理被认为是治疗败血症的关键,并可能挽救数百万人的生命和数十亿美元的医疗费用。专业的临床护理人员提出了有助于早期发现脓毒症的临床标准;然而,这些标准的性能往往是有限的。除了结构化的临床数据外,临床文本还提供了评估脓毒症严重程度的基本信息。在这项研究中,我们探讨如何临床文本可以补充结构化数据的早期脓毒症预测任务。在这篇论文中,我们提出了一个多模态模型,它包含了病人测量的结构化数据以及病人的文本注释。我们采用最先进的NLP模型(如BERT)和Amazon中高度专业化的NLP模型来表示文本。在包含ICU入院记录的MIMIC-III数据集上,我们显示通过使用这些注释,脓毒症预测的标准效用评分提高了6.07分,AUROC评分提高了2.89%。我们的方法显著优于专家建议的临床标准qSOFA,以及心脏病学挑战赛中生理网络计算预测败血症的获胜模型。 摘要:Sepsis is a life-threatening disease with high morbidity, mortality and healthcare costs. The early prediction and administration of antibiotics and intravenous fluids is considered crucial for the treatment of sepsis and can save potentially millions of lives and billions in health care costs. Professional clinical care practitioners have proposed clinical criterion which aid in early detection of sepsis; however, performance of these criterion is often limited. Clinical text provides essential information to estimate the severity of the sepsis in addition to structured clinical data. In this study, we explore how clinical text can complement structured data towards early sepsis prediction task. In this paper, we propose multi modal model which incorporates both structured data in the form of patient measurements as well as textual notes on the patient. We employ state-of-the-art NLP models such as BERT and a highly specialized NLP model in Amazon Comprehend Medical to represent the text. On the MIMIC-III dataset containing records of ICU admissions, we show that by using these notes, one achieves an improvement of 6.07 points in a standard utility score for Sepsis prediction and 2.89% in AUROC score. Our methods significantly outperforms a clinical criteria suggested by experts, qSOFA, as well as the winning model of the PhysioNet Computing in Cardiology Challenge for predicting Sepsis.

【3】 Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion 标题:用于非典型语音转换的显式韵律模型学习和深度说话人嵌入

作者:Disong Wang,Songxiang Liu,Lifa Sun,Xixin Wu,Xunying Liu,Helen Meng 机构:Human-Computer Communications Laboratory, The Chinese University of Hong Kong, Hong Kong SAR, China, SpeechX Limited, Shenzhen, China, Department of Engineering, University of Cambridge, UK 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2011.01678 摘要:尽管在典型语音的语音转换(VC)方面已经取得了很大的进展,但是对于非典型语音(如构音障碍和第二语言(L2)语音)的语音转换仍然是一个挑战,因为它涉及到在保持说话人身份的同时纠正非典型韵律。为了解决这个问题,我们提出了一个具有显式韵律建模和深度说话人嵌入(DSE)学习的VC系统。首先,语音编码器努力从非典型语音中提取健壮的音素嵌入。其次,韵律校正器通过音位嵌入来推断典型的音位时长和音高值。第三,转换模型以音素嵌入和典型韵律特征为输入,以说话人编码器或说话人自适应学习的目标DSE为条件,生成转换后的语音。大量实验表明,说话人自适应可以获得更高的说话人相似度,基于说话人编码器的转换模型可以大大减少构音障碍和非母语发音模式,提高语音清晰度。将原始语音和转换语音的识别结果进行了比较,结果表明,该方法可使语音识别的字符错误率(CER)和单词错误率(WER)分别降低47.6%和29.3%。 摘要:Though significant progress has been made for the voice conversion (VC) of typical speech, VC for atypical speech, e.g., dysarthric and second-language (L2) speech, remains a challenge, since it involves correcting for atypical prosody while maintaining speaker identity. To address this issue, we propose a VC system with explicit prosodic modelling and deep speaker embedding (DSE) learning. First, a speech-encoder strives to extract robust phoneme embeddings from atypical speech. Second, a prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values. Third, a conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech, conditioned on the target DSE that is learned via speaker encoder or speaker adaptation. Extensive experiments demonstrate that speaker adaptation can achieve higher speaker similarity, and the speaker encoder based conversion model can greatly reduce dysarthric and non-native pronunciation patterns with improved speech intelligibility. A comparison of speech recognition results between the original dysarthric speech and converted speech show that absolute reduction of 47.6% character error rate (CER) and 29.3% word error rate (WER) can be achieved.

其他(2篇)

【1】 OLR 2021 Challenge: Datasets, Rules and Baselines 标题:OLR 2021挑战:数据集、规则和基线

作者:Binling Wang,Wenxuan Hu,Jing Li,Yiming Zhi,Zheng Li,Qingyang Hong,Lin Li,Dong Wang,Liming Song,Cheng Yang 机构:† School of Informatics, Xiamen University, ‡ School of Electronic Science and Engineering, Xiamen University, § Center for Speech and Language Technologies, Tsinghua University, ¶ Beijing National Research Center for Information Science and Technology 备注:arXiv admin note: text overlap with arXiv:2006.03473, arXiv:1907.07626, arXiv:1806.00616, arXiv:1706.09742 链接:https://arxiv.org/abs/2107.11113 摘要:本文介绍了第六届东方语言识别(OLR)2021挑战赛,旨在提高语言识别系统和语音识别系统在多语言场景下的性能。本文介绍了数据概况、四项任务、两条基线和评价原则。除了语言识别(LID)任务外,多语言自动语音识别(ASR)任务首次被引入olr2021挑战。今年的挑战集中在更实际和更具挑战性的问题上,有四项任务:(1)受限LID,(2)无约束LID,(3)受限多语言ASR,(4)无约束多语言ASR。分别提供了LID任务和多语言ASR任务的基线。LID基线系统是用Pytorch构造的扩展tdnnx矢量模型。提供了一个基于转换器的端到端模型作为多语言ASR基线系统。这些食谱将在线发布,供参与者构建自己的LID或ASR系统。基线结果表明,这些任务具有相当大的挑战性,需要付出更多努力才能取得更好的绩效。 摘要:This paper introduces the sixth Oriental Language Recognition (OLR) 2021 Challenge, which intends to improve the performance of language recognition systems and speech recognition systems within multilingual scenarios. The data profile, four tasks, two baselines, and the evaluation principles are introduced in this paper. In addition to the Language Identification (LID) tasks, multilingual Automatic Speech Recognition (ASR) tasks are introduced to OLR 2021 Challenge for the first time. The challenge this year focuses on more practical and challenging problems, with four tasks: (1) constrained LID, (2) unconstrained LID, (3) constrained multilingual ASR, (4) unconstrained multilingual ASR. Baselines for LID tasks and multilingual ASR tasks are provided, respectively. The LID baseline system is an extended TDNN x-vector model constructed with Pytorch. A transformer-based end-to-end model is provided as the multilingual ASR baseline system. These recipes will be online published, and available for participants to construct their own LID or ASR systems. The baseline results demonstrate that those tasks are rather challenging and deserve more effort to achieve better performance.

【2】 FNetAR: Mixing Tokens with Autoregressive Fourier Transforms 标题:FNetAR:混合符号与自回归傅立叶变换

作者:Tim Lou,Michael Park,Mohammad Ramezanali,Vincent Tang 机构:X-Mechanics, Cresskill, NJ, LiveRamp, San Fransisco, CA, Appliedinfo Partners, Somerset, NJ, Salesforce, San Fransisco, CA, SamsungNEXT, New York, NY 备注:final experimental results forthcoming 链接:https://arxiv.org/abs/2107.10932 摘要:在本文中,我们研究了FNet算法的自回归推广,其中标准Transformer结构的自关注层被基于傅立叶变换的稀疏均匀采样过程所取代。使用Wikitext-103基准,我们证明FNetAR在因果语言建模任务上保持了最先进的性能(25.8 ppl),而Transformer-XL的基线(24.2 ppl)只有自我注意层数量的一半,从而为具有重复合机制的深层神经网络的超流性提供了进一步的证据。在大多数基于Transformer的时间序列预测模型中,自回归傅里叶变换可能用于参数还原。 摘要:In this note we examine the autoregressive generalization of the FNet algorithm, in which self-attention layers from the standard Transformer architecture are substituted with a trivial sparse-uniformsampling procedure based on Fourier transforms. Using the Wikitext-103 benchmark, we demonstratethat FNetAR retains state-of-the-art performance (25.8 ppl) on the task of causal language modelingcompared to a Transformer-XL baseline (24.2 ppl) with only half the number self-attention layers,thus providing further evidence for the superfluity of deep neural networks with heavily compoundedattention mechanisms. The autoregressive Fourier transform could likely be used for parameterreduction on most Transformer-based time-series prediction models.

0 人点赞