自然语言处理学术速递[10.18]

2021-10-21 16:09:35 浏览数 (1)

Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!

cs.CL 方向,今日共计65篇

Transformer(3篇)

【1】 StreaMulT: Streaming Multimodal Transformer for Heterogeneous and Arbitrary Long Sequential Data 标题:StreaMulT:面向异构任意长序列数据的流式多模式转换器 链接:https://arxiv.org/abs/2110.08021

作者:Victor Pellegrain,Myriam Tami,Michel Batteux,Céline Hudelot 机构:† Institut de Recherche Technologique SystemX, boulevard Thomas Gobert, Palaiseau, France, ⋆ Universit´e Paris-Saclay, CentraleSup´elec, MICS, Gif-sur-Yvette, France 备注:5 pages, 4 figures, submitted to ICASSP 2022 摘要:本文解决了处理和组合来自不同模式、不同采集频率的任意长数据流的问题。例如,常见应用可以是从多模态异构数据(传感器数据、监控报告、图像等)进行长时间工业或现实系统监控。为了解决这个问题,我们提出了StreaMulT,一个流式多模式转换器,它依赖于跨模式注意和一个扩充内存库,在训练时处理任意长的输入序列,并在推理时以流式方式运行。StreaMulT在CMU-MOSEI数据集上再现最先进的结果,同时能够处理比以前的多模式Transformer等其他模型更长的输入。 摘要:This paper tackles the problem of processing and combining efficiently arbitrary long data streams, coming from different modalities with different acquisition frequencies. Common applications can be, for instance, long-time industrial or real-life systems monitoring from multimodal heterogeneous data (sensor data, monitoring report, images, etc.). To tackle this problem, we propose StreaMulT, a Streaming Multimodal Transformer, relying on cross-modal attention and an augmented memory bank to process arbitrary long input sequences at training time and run in a streaming way at inference. StreaMulT reproduces state-of-the-art results on CMU-MOSEI dataset, while being able to deal with much longer inputs than other models such as previous Multimodal Transformer.

【2】 Crisis Domain Adaptation Using Sequence-to-sequence Transformers 标题:使用序列到序列转换器的危机域自适应 链接:https://arxiv.org/abs/2110.08015

作者:Congcong Wang,Paul Nulty,David Lillis 机构:Crisis Domain Adaptation Using Sequence-to-sequence Transformers, School of Computer Science, University College Dublin 备注:18th International Conference on Information Systems for Crisis Response and Management (ISCRAM 2021) 摘要:社交媒体上的用户生成内容(UGC)可以作为危机情况下应急响应者的关键信息来源。然而,由于相关的数量,需要使用计算技术来有效地过滤和优先处理在新兴事件期间出现的内容。在文献中,这些技术是使用先前危机的注释内容进行训练的。在本文中,我们通过研究类似类型的危机事件在多大程度上更适合于适应新事件(跨领域适应),研究如何最好地利用这些先验知识应对新危机。鉴于transformers最近在各种语言处理任务中取得的成功,我们提出了CAST:一种利用序列到序列转换器的危机域自适应方法。我们使用两个主要的危机相关信息分类数据集评估CAST。我们的实验表明,我们基于CAST的最佳运行在不使用任何目标数据的情况下,在域和跨域上下文中都达到了最先进的性能。此外,当使用更大的语言模型进行训练时,CAST在一对一跨域适应方面尤其有效。在多对一改编中,多个危机被联合用作源域,CAST进一步提高了其性能。此外,我们发现更多相似事件更有可能带来更好的适应性能,而使用不同事件进行微调则无助于适应。为了帮助再现性,我们向社区开放代码源代码。 摘要:User-generated content (UGC) on social media can act as a key source of information for emergency responders in crisis situations. However, due to the volume concerned, computational techniques are needed to effectively filter and prioritise this content as it arises during emerging events. In the literature, these techniques are trained using annotated content from previous crises. In this paper, we investigate how this prior knowledge can be best leveraged for new crises by examining the extent to which crisis events of a similar type are more suitable for adaptation to new events (cross-domain adaptation). Given the recent successes of transformers in various language processing tasks, we propose CAST: an approach for Crisis domain Adaptation leveraging Sequence-to-sequence Transformers. We evaluate CAST using two major crisis-related message classification datasets. Our experiments show that our CAST-based best run without using any target data achieves the state of the art performance in both in-domain and cross-domain contexts. Moreover, CAST is particularly effective in one-to-one cross-domain adaptation when trained with a larger language model. In many-to-one adaptation where multiple crises are jointly used as the source domain, CAST further improves its performance. In addition, we find that more similar events are more likely to bring better adaptation performance whereas fine-tuning using dissimilar events does not help for adaptation. To aid reproducibility, we open source our code to the community.

【3】 Transformer-based Multi-task Learning for Disaster Tweet Categorisation 标题:基于Transformer的灾难推文分类多任务学习 链接:https://arxiv.org/abs/2110.08010

作者:Congcong Wang,Paul Nulty,David Lillis 机构:Transformer-based Multi-task Learning for Disaster Tweet Categorisation, School of Computer Science, University College Dublin 备注:18th International Conference on Information Systems for Crisis Response and Management (ISCRAM 2021) 摘要:社交媒体使人们能够及时传播信息,从而激发人们在危机情况下发布寻求帮助的信息。这些信息有助于应急响应人员的态势感知,他们需要根据信息类型(即信息请求的援助服务类型)对其进行分类。我们介绍了一种基于转换器的多任务学习(MTL)技术,用于分类信息类型和估计这些消息的优先级。我们通过向TREC事件流(IS)跟踪提交运行记录,用各种指标评估我们方法的有效性:这是一项专门为灾难推特分类和优先级排序而设计的研究计划。结果表明,与其他参与运行相比,我们的方法在大多数指标上实现了具有竞争力的性能。随后,我们发现在我们的方法中结合不同Transformer编码器的集成方法有助于在很大程度上提高整体效率,在几乎所有度量中实现最先进的性能。我们将代码公开,以便我们的工作可以复制并用作社区在该领域未来工作的基线。 摘要:Social media has enabled people to circulate information in a timely fashion, thus motivating people to post messages seeking help during crisis situations. These messages can contribute to the situational awareness of emergency responders, who have a need for them to be categorised according to information types (i.e. the type of aid services the messages are requesting). We introduce a transformer-based multi-task learning (MTL) technique for classifying information types and estimating the priority of these messages. We evaluate the effectiveness of our approach with a variety of metrics by submitting runs to the TREC Incident Streams (IS) track: a research initiative specifically designed for disaster tweet classification and prioritisation. The results demonstrate that our approach achieves competitive performance in most metrics as compared to other participating runs. Subsequently, we find that an ensemble approach combining disparate transformer encoders within our approach helps to improve the overall effectiveness to a significant extent, achieving state-of-the-art performance in almost every metric. We make the code publicly available so that our work can be reproduced and used as a baseline for the community for future work in this domain.

QA|VQA|问答|对话(7篇)

【1】 BBQ: A Hand-Built Bias Benchmark for Question Answering 标题:BBQ:一个手工构建的问答偏向基准 链接:https://arxiv.org/abs/2110.08193

作者:Alicia Parrish,Angelica Chen,Nikita Nangia,Vishakh Padmakumar,Jason Phang,Jana Thompson,Phu Mon Htut,Samuel R. Bowman 机构:New York University, Dept. of Linguistics, Center for Data Science, Dept. of Computer Science 备注:16 pages, 9 figures 摘要:NLP模型可以学习世界上存在的社会偏见,这一点已经得到了很好的证明,但对于这些偏见如何体现在诸如问答(QA)等应用任务的实际模型输出中,却鲜有研究。我们介绍了QA(BBQ)的偏见基准,这是一个由作者构建的问题集组成的数据集,突出了针对受保护阶层的人的社会偏见,以及与美国英语环境相关的九个不同社会维度。我们的任务在两个不同的层面上评估模型反应:(i)给定信息量不足的环境,测试模型答案反映社会偏见的程度;(ii)给定信息量充足的环境,测试模型的偏见是否仍然凌驾于正确的答案选择之上。我们发现,当背景模棱两可时,模型强烈地依赖于刻板印象,这意味着模型的输出在这种环境下持续地再现有害的偏见。虽然当上下文提供明确的答案时,模型更为准确,但它们仍然依赖于刻板印象信息,在正确答案与社会偏见一致的例子中,模型的准确度要高出2.5个百分点,而针对性别的例子的准确度差异扩大到5个百分点。 摘要:It is well documented that NLP models learn social biases present in the world, but little work has been done to show how these biases manifest in actual model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset consisting of question-sets constructed by the authors that highlight textit{attested} social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts. Our task evaluates model responses at two distinct levels: (i) given an under-informative context, test how strongly model answers reflect social biases, and (ii) given an adequately informative context, test whether the model's biases still override a correct answer choice. We find that models strongly rely on stereotypes when the context is ambiguous, meaning that the model's outputs consistently reproduce harmful biases in this setting. Though models are much more accurate when the context provides an unambiguous answer, they still rely on stereotyped information and achieve an accuracy 2.5 percentage points higher on examples where the correct answer aligns with a social bias, with this accuracy difference widening to 5 points for examples targeting gender.

【2】 MixQG: Neural Question Generation with Mixed Answer Types 标题:MixQG:混合答案类型的神经问题生成 链接:https://arxiv.org/abs/2110.08175

作者:Lidiya Murakhovs'ka,Chien-Sheng Wu,Tong Niu,Wenhao Liu,Caiming Xiong 机构:Salesforce AI Research 摘要:问好问题是人类和机器智能的基本能力。然而,现有的神经问题生成方法主要集中在短因子类型的答案上。在本文中,我们提出了一种神经问题生成器,MixQG,来填补这一空白。我们将9个不同答案类型的问答数据集(包括是/否、多项选择、抽取和抽象答案)结合起来,以训练单一生成模型。我们的实证结果表明,我们的模型在可见和不可见领域都优于现有的研究,并且在不同的答案类型条件下,可以生成具有不同认知水平的问题。我们的代码已经发布,并与Huggingface库很好地集成,以方便各种下游应用程序。 摘要:Asking good questions is an essential ability for both human and machine intelligence. However, existing neural question generation approaches mainly focus on the short factoid type of answers. In this paper, we propose a neural question generator, MixQG, to bridge this gap. We combine 9 question answering datasets with diverse answer types, including yes/no, multiple-choice, extractive, and abstractive answers, to train a single generative model. We show with empirical results that our model outperforms existing work in both seen and unseen domains and can generate questions with different cognitive levels when conditioned on different answer types. Our code is released and well-integrated with the Huggingface library to facilitate various downstream applications.

【3】 Few-Shot Bot: Prompt-Based Learning for Dialogue Systems 标题:Few-Shot机器人:对话系统中基于提示的学习 链接:https://arxiv.org/abs/2110.08118

作者:Andrea Madotto,Zhaojiang Lin,Genta Indra Winata,Pascale Fung 机构:Department of Electronics and Computer Engineering, The Hong Kong University of Science and Technology 摘要:在会话人工智能中,学习仅使用几个例子进行会话是一个巨大的挑战。当前最好的会话模型是在大型会话数据集上进行微调的语言模型(LMs),它们要么是良好的聊天工具(如BlenderBot),要么是面向目标的系统(如MinTL)。训练这些模型在计算资源和时间上都是昂贵的,而且很难用新的会话技能使它们跟上时代。一个简单但尚未探索的解决方案是基于即时的少量快照学习(Brown et al.2020),它不需要基于梯度的微调,而是使用LM上下文中的一些示例作为唯一的学习来源。在这篇论文中,我们探讨了在对话任务中基于提示的Few-Shot学习。我们在九个响应生成任务中测试了不同大小的LMs,其中包括四个基于知识的任务、一个面向任务的生成任务、三个开放式聊天任务和受控风格生成任务,以及五个会话解析任务,其中包括对话状态跟踪、图形路径生成、人物角色信息提取、,文档检索和internet查询生成。目前发布的最大的LM(GPT-J-6B)采用基于提示的少量射击学习,因此无需训练,与经过充分训练的最先进模型相比,其性能具有竞争力。此外,我们提出了一种新的基于提示的Few-Shot分类器,该分类器也不需要任何微调,可以根据对话历史选择最合适的提示。最后,通过结合基于提示的少数镜头学习和技能选择器的功能,我们创建了一个名为少数镜头机器人(FSB)的端到端聊天机器人,它自动选择最合适的会话技能,查询不同的知识库或互联网,并使用检索到的知识生成类似人类的响应,每个技能都只使用很少的对话示例。 摘要:Learning to converse using only a few examples is a great challenge in conversational AI. The current best conversational models, which are either good chit-chatters (e.g., BlenderBot) or goal-oriented systems (e.g., MinTL), are language models (LMs) fine-tuned on large conversational datasets. Training these models is expensive, both in terms of computational resources and time, and it is hard to keep them up to date with new conversational skills. A simple yet unexplored solution is prompt-based few-shot learning (Brown et al. 2020) which does not require gradient-based fine-tuning but instead uses a few examples in the LM context as the only source of learning. In this paper, we explore prompt-based few-shot learning in dialogue tasks. We benchmark LMs of different sizes in nine response generation tasks, which include four knowledge-grounded tasks, a task-oriented generations task, three open-chat tasks, and controlled stylistic generation, and five conversational parsing tasks, which include dialogue state tracking, graph path generation, persona information extraction, document retrieval, and internet query generation. The current largest released LM (GPT-J-6B) using prompt-based few-shot learning, and thus requiring no training, achieves competitive performance to fully trained state-of-the-art models. Moreover, we propose a novel prompt-based few-shot classifier, that also does not require any fine-tuning, to select the most appropriate prompt given a dialogue history. Finally, by combining the power of prompt-based few-shot learning and a Skill Selector, we create an end-to-end chatbot named the Few-Shot Bot (FSB), which automatically selects the most appropriate conversational skill, queries different knowledge bases or the internet, and uses the retrieved knowledge to generate a human-like response, all using only few dialogue examples per skill.

【4】 Multimodal Emotion-Cause Pair Extraction in Conversations 标题:会话中的多模态情感原因对提取 链接:https://arxiv.org/abs/2110.08020

作者:Fanfan Wang,Zixiang Ding,Rui Xia,Zhaoyu Li,Jianfei Yu 机构:School of Computer Science and Engineering, Nanjing University of Science and Technology, China 摘要:情绪成因分析近年来受到了广泛的关注。以前的研究主要集中在从新闻文章或微博文本中提取情感原因。在对话中发现情绪及其原因也很有趣。由于会话的自然形式是多模态的,人们对会话中的多模态情感识别进行了大量的研究,但对多模态情感成因分析的研究仍然缺乏。在这项工作中,我们引入了一个新的任务,名为多模态情感-原因对提取,旨在从多模态(文本、音频和视频)反映的对话中联合提取情感及其相关原因。因此,我们构建了一个多模态会话情绪原因数据集,名为《朋友中的情绪原因》,该数据集包含9272对多模态情绪原因对,对情景喜剧《朋友》中13509条话语进行了注释。最后,我们通过建立一个基线系统来对任务进行基准测试,该系统包含多模态特征,用于情绪-原因对提取。初步的实验结果证明了多模态信息融合在发现会话中的情感和原因方面的潜力。 摘要:Emotion cause analysis has received considerable attention in recent years. Previous studies primarily focused on emotion cause extraction from texts in news articles or microblogs. It is also interesting to discover emotions and their causes in conversations. As conversation in its natural form is multimodal, a large number of studies have been carried out on multimodal emotion recognition in conversations, but there is still a lack of work on multimodal emotion cause analysis. In this work, we introduce a new task named Multimodal Emotion-Cause Pair Extraction in Conversations, aiming to jointly extract emotions and their associated causes from conversations reflected in multiple modalities (text, audio and video). We accordingly construct a multimodal conversational emotion cause dataset, Emotion-Cause-in-Friends, which contains 9,272 multimodal emotion-cause pairs annotated on 13,509 utterances in the sitcom Friends. We finally benchmark the task by establishing a baseline system that incorporates multimodal features for emotion-cause pair extraction. Preliminary experimental results demonstrate the potential of multimodal information fusion for discovering both emotions and causes in conversations.

【5】 ContraQA: Question Answering under Contradicting Contexts 标题:ContraQA:矛盾语境下的问答 链接:https://arxiv.org/abs/2110.07803

作者:Liangming Pan,Wenhu Chen,Min-Yen Kan,William Yang Wang 机构:National University of Singapore, Singapore, University of California, Santa Barbara, CA, USA 备注:Technical report 摘要:随着宣传、新闻和社交媒体中虚假、不准确和误导性信息的增加,真实世界的问答(QA)系统面临着综合和推理矛盾信息以得出正确答案的挑战。这种紧迫性导致需要使QA系统对错误信息具有鲁棒性,这是一个以前未探讨过的话题。我们通过调查QA模型在混合了真实和虚假信息的矛盾情境下的行为来研究错误信息对QA模型的风险。我们为这个问题创建了第一个大规模数据集,即Contra QA,其中包含超过10000个人类编写的和模型生成的矛盾上下文对。实验表明,在错误信息带来的矛盾语境下,QA模型是脆弱的。为了防御这样的威胁,我们建立了一个错误信息感知的QA系统,作为一种对抗措施,它以联合方式集成了问答和错误信息检测。 摘要:With a rise in false, inaccurate, and misleading information in propaganda, news, and social media, real-world Question Answering (QA) systems face the challenges of synthesizing and reasoning over contradicting information to derive correct answers. This urgency gives rise to the need to make QA systems robust to misinformation, a topic previously unexplored. We study the risk of misinformation to QA models by investigating the behavior of the QA model under contradicting contexts that are mixed with both real and fake information. We create the first large-scale dataset for this problem, namely Contra-QA, which contains over 10K human-written and model-generated contradicting pairs of contexts. Experiments show that QA models are vulnerable under contradicting contexts brought by misinformation. To defend against such a threat, we build a misinformation-aware QA system as a counter-measure that integrates question answering and misinformation detection in a joint fashion.

【6】 CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training 标题:CCQA:一种新的用于模型预训练的Web规模问答数据集 链接:https://arxiv.org/abs/2110.07731

作者:Patrick Huber,Armen Aghajanyan,Barlas Oğuz,Dmytro Okhonko,Wen-tau Yih,Sonal Gupta,Xilun Chen 机构:†University of British Columbia; ‡Facebook Inc. 摘要:随着大规模预训练语言模型的兴起,开放领域问答(ODQA)已成为NLP的一个重要研究课题。基于流行的预训练微调方法,我们假设使用大规模、自然和多样的问答(QA)数据集的额外域内预训练阶段有利于ODQA。因此,本文提出了一种基于公共爬网项目的新型QA数据集。使用现成的schema.org注释,我们提取了大约1.3亿个多语言问答对,包括大约6000万个英语数据点。有了这些之前未发现的自然问答对,我们对流行的语言模型进行了预训练,以展示大规模领域内预训练在问答任务中的潜力。在我们的实验中,我们发现,在我们的通用爬网问答数据集(CCQA)上的预训练问答模型在跨多个任务、模型和基准的Zero-Shot、低资源和微调设置方面取得了令人满意的结果。 摘要:With the rise of large-scale pre-trained language models, open-domain question-answering (ODQA) has become an important research topic in NLP. Based on the popular pre-training fine-tuning approach, we posit that an additional in-domain pre-training stage using a large-scale, natural, and diverse question-answering (QA) dataset can be beneficial for ODQA. Consequently, we propose a novel QA dataset based on the Common Crawl project in this paper. Using the readily available schema.org annotation, we extract around 130 million multilingual question-answer pairs, including about 60 million English data-points. With this previously unseen number of natural QA pairs, we pre-train popular language models to show the potential of large-scale in-domain pre-training for the task of question-answering. In our experiments, we find that pre-training question-answering models on our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks.

【7】 GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented Dialogue Systems 标题:GlobalWoZ:全球化MultiWoZ开发面向任务的多语言对话系统 链接:https://arxiv.org/abs/2110.07679

作者:Bosheng Ding,Junjie Hu,Lidong Bing,Sharifah Mahani Aljunied,Shafiq Joty,Luo Si,Chunyan Miao 机构:Nanyang Technological University, Singapore ,DAMO Academy, Alibaba Group, University of Wisconsin-Madison 摘要:任务导向对话(ToD)系统的许多最新进展都是由跨多个训练领域的可用注释数据推动的。在过去的几年中,有一种趋势是多语言ToD系统的数据管理,适用于服务于使用不同语言的人。然而,现有的多语言ToD数据集要么由于数据整理的高成本而对语言的覆盖范围有限,要么忽视了对话实体在使用这些语言的国家几乎不存在的事实。为了解决这些局限性,我们引入了一种新的数据管理方法来生成GlobalWoZ——一个大规模的多语言ToD数据集,它从一个英语ToD数据集全球化,用于三个未开发的用例。我们的方法基于翻译对话模板,并用目标语言国家的本地实体填充它们。我们发布了我们的数据集以及一组强大的基线,以鼓励研究如何为实际用例学习多语言ToD系统。 摘要:Much recent progress in task-oriented dialogue (ToD) systems has been driven by available annotation data across multiple domains for training. Over the last few years, there has been a move towards data curation for multilingual ToD systems that are applicable to serve people speaking different languages. However, existing multilingual ToD datasets either have a limited coverage of languages due to the high cost of data curation, or ignore the fact that dialogue entities barely exist in countries speaking these languages. To tackle these limitations, we introduce a novel data curation method that generates GlobalWoZ -- a large-scale multilingual ToD dataset globalized from an English ToD dataset for three unexplored use cases. Our method is based on translating dialogue templates and filling them with local entities in the target-language countries. We release our dataset as well as a set of strong baselines to encourage research on learning multilingual ToD systems for real use cases.

机器翻译(7篇)

【1】 Direct simultaneous speech to speech translation 标题:直接同步语音到语音翻译 链接:https://arxiv.org/abs/2110.08250

作者:Xutai Ma,Hongyu Gong,Danni Liu,Ann Lee,Yun Tang,Peng-Jen Chen,Wei-Ning Hsu,Kenneth Heafield,Phillip Koehn,Juan Pino 机构:Facebook AI ,Johns Hopkins University ,Maastricht University , University of Edinburgh 摘要:我们提出了第一个直接同步语音到语音翻译(Simul-S2ST)模型,该模型能够在使用完整的源语音内容之前,独立于中间文本表示,在目标语音中开始生成翻译。我们的方法利用了离散单元直接语音到语音翻译的最新进展。代替连续谱图特征,从模型中预测以无监督方式学习的一系列直接表示,并直接传递给声码器进行语音合成。然后,同步策略对源语音特征和目标离散单元进行操作。最后,声码器动态地从离散单元合成目标语音。我们在Fisher西班牙语英语数据集上进行了数值研究,比较了级联方法和直接方法。 摘要:We present the first direct simultaneous speech-to-speech translation (Simul-S2ST) model, with the ability to start generating translation in the target speech before consuming the full source speech content and independently from intermediate text representations. Our approach leverages recent progress on direct speech-to-speech translation with discrete units. Instead of continuous spectrogram features, a sequence of direct representations, which are learned in a unsupervised manner, are predicted from the model and passed directly to a vocoder for speech synthesis. The simultaneous policy then operates on source speech features and target discrete units. Finally, a vocoder synthesize the target speech from discrete units on-the-fly. We carry out numerical studies to compare cascaded and direct approach on Fisher Spanish-English dataset.

【2】 Tricks for Training Sparse Translation Models 标题:稀疏翻译模型的训练技巧 链接:https://arxiv.org/abs/2110.08246

作者:Dheeru Dua,Shruti Bhosale,Vedanuj Goswami,James Cross,Mike Lewis,Angela Fan 机构:♥University of California, Irvine, USA, ♠Facebook AI 摘要:数据分布不平衡的多任务学习会使模型学习向高资源任务倾斜,尤其是当模型容量固定且在所有任务之间完全共享时。稀疏缩放架构(如BASELayers)为不同的任务提供了灵活的机制,使其具有可变数量的参数,这有助于平衡扭曲的数据分布。我们发现用于多语言机器翻译的稀疏体系结构在开箱即用的情况下表现不佳,并提出了两种简单的技术来缓解这种情况——温度加热机制和密集预训练。总的来说,与标准基线和密集缩放基线相比,这些方法在两个多语言翻译基准上提高了性能,并且结合起来,模型收敛速度提高了2倍以上。 摘要:Multi-task learning with an unbalanced data distribution skews model learning towards high resource tasks, especially when model capacity is fixed and fully shared across all tasks. Sparse scaling architectures, such as BASELayers, provide flexible mechanisms for different tasks to have a variable number of parameters, which can be useful to counterbalance skewed data distributions. We find that that sparse architectures for multilingual machine translation can perform poorly out of the box, and propose two straightforward techniques to mitigate this - a temperature heating mechanism and dense pre-training. Overall, these methods improve performance on two multilingual translation benchmarks compared to standard BASELayers and Dense scaling baselines, and in combination, more than 2x model convergence speed.

【3】 Incremental Speech Synthesis For Speech-To-Speech Translation 标题:增量式语音合成技术在语音到语音翻译中的应用 链接:https://arxiv.org/abs/2110.08214

作者:Danni Liu,Changhan Wang,Hongyu Gong,Xutai Ma,Yun Tang,Juan Pino 机构:Maastricht University,Facebook AI, Johns Hopkins University 备注:Work-in-progress 摘要:在语音到语音翻译(S2ST)管道中,文本到语音(TTS)模块是将翻译后的语音传递给用户的重要组件。要启用增量S2ST,TTS模块必须能够在其输入文本仍在流式输入时合成和播放语音。在这项工作中,我们致力于提高TTS模型的增量综合性能。通过一种基于前缀的简单数据扩充策略,我们能够提高增量TTS质量,以接近离线性能。此外,我们将我们的增量TTS系统与上游同步语音翻译系统结合到实际场景中,并展示了该用例的收益。此外,我们提出了适合S2ST应用程序的延迟度量,并在此背景下研究了减少延迟的方法。 摘要:In a speech-to-speech translation (S2ST) pipeline, the text-to-speech (TTS) module is an important component for delivering the translated speech to users. To enable incremental S2ST, the TTS module must be capable of synthesizing and playing utterances while its input text is still streaming in. In this work, we focus on improving the incremental synthesis performance of TTS models. With a simple data augmentation strategy based on prefixes, we are able to improve the incremental TTS quality to approach offline performance. Furthermore, we bring our incremental TTS system to the practical scenario in combination with an upstream simultaneous speech translation system, and show the gains also carry over to this use-case. In addition, we propose latency metrics tailored to S2ST applications, and investigate methods for latency reduction in this context.

【4】 Why don't people use character-level machine translation? 标题:为什么人们不使用字符级机器翻译呢? 链接:https://arxiv.org/abs/2110.08191

作者:Jindřich Libovický,Helmut Schmid,Alexander Fraser 机构:Center for Information and Speech Processing, LMU Munich, Munich, Germany 备注:16 pages, 4 figures 摘要:我们提供了一份文献和实证调查,批判性地评估了机器翻译(MT)字符级建模的最新进展。尽管文献中有证据表明,字符级系统与子词系统具有可比性,但它们实际上从未在WMT比赛的竞争设置中使用。我们的经验表明,即使最近在字符级自然语言处理方面进行了建模创新,字符级机器翻译系统仍然难以在翻译质量、训练和推理速度方面与基于子词的机器翻译系统相匹配。字符级机器翻译系统既没有表现出更好的领域稳健性,也没有表现出更好的形态学泛化,尽管经常受到这样的激励。另一方面,它们对源侧噪声更具鲁棒性,并且在解码时,平移质量不会随着波束大小的增加而降低。 摘要:We present a literature and empirical survey that critically assesses the state of the art in character-level modeling for machine translation (MT). Despite evidence in the literature that character-level systems are comparable with subword systems, they are virtually never used in competitive setups in WMT competitions. We empirically show that even with recent modeling innovations in character-level natural language processing, character-level MT systems still struggle to match their subword-based counterparts both in terms of translation quality and training and inference speed. Character-level MT systems show neither better domain robustness, nor better morphological generalization, despite being often so motivated. On the other hand, they tend to be more robust towards source side noise and the translation quality does not degrade with increasing beam size at decoding time.

【5】 Breaking Down Multilingual Machine Translation 标题:细分多语言机器翻译 链接:https://arxiv.org/abs/2110.08130

作者:Ting-Rui Chiang,Yi-Pei Chen,Yi-Ting Yeh,Graham Neubig 机构:Carnegie Mellon University,The University of Tokyo 摘要:虽然多语言训练现在是机器翻译(MT)系统中的一个重要组成部分,但最近的研究表明,它在不同的多语言环境中具有不同的效果,例如多对一、一对多和多对多学习。这些训练设置在具有不同数据分布的机器翻译模型中公开编码器和解码器。在本文中,我们研究了不同种类的多语言训练如何有助于学习机器翻译模型的这两个组成部分。具体来说,我们将双语模型与通过多语言训练初始化的编码器和/或解码器进行比较。我们表明,多语言训练通常对编码器有益,而它只对低资源语言(LRL)的解码器有益。我们进一步找出每对语言的重要注意头,并比较它们在推理过程中的相关性。我们的分析揭示了多语言翻译模型的工作原理,也使我们能够提出通过使用高度相关的语言进行训练来提高性能的方法。我们针对高资源语言的多对一模型和LRL的一对多模型优于Aharoni等人(2019)报告的最佳结果。 摘要:While multilingual training is now an essential ingredient in machine translation (MT) systems, recent work has demonstrated that it has different effects in different multilingual settings, such as many-to-one, one-to-many, and many-to-many learning. These training settings expose the encoder and the decoder in a machine translation model with different data distributions. In this paper, we examine how different varieties of multilingual training contribute to learning these two components of the MT model. Specifically, we compare bilingual models with encoders and/or decoders initialized by multilingual training. We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs). We further find the important attention heads for each language pair and compare their correlations during inference. Our analysis sheds light on how multilingual translation models work and also enables us to propose methods to improve performance by training with highly related languages. Our many-to-one models for high-resource languages and one-to-many models for LRL outperform the best results reported by Aharoni et al. (2019).

【6】 Multilingual Neural Machine Translation:Can Linguistic Hierarchies Help? 标题:多语言神经机器翻译:语言层次能有所帮助吗? 链接:https://arxiv.org/abs/2110.07816

作者:Fahimeh Saleh,Wray Buntine,Gholamreza Haffari,Lan Du 机构:Monash University 摘要:多语言神经机器翻译(MNMT)训练单一的NMT模型,支持多种语言之间的翻译,而不是为不同的语言训练单独的模型。通过利用来自多种语言的数据,学习单一模型可以增强低资源翻译。然而,MNMT模型的性能在很大程度上取决于训练中使用的语言类型,因为从不同语言中转移知识会由于负迁移而降低翻译性能。在本文中,我们提出了一种用于MNMT的分层知识提取(HKD)方法,该方法利用根据语言的类型特征和发展史生成的语言组来克服负迁移问题。HKD通过基于语言组的选择性知识提取机制生成一组多语言助教模型,然后以自适应方式从这些助教中提取最终的多语言模型。来自53种语言的TED数据集的实验结果表明,与强基线相比,我们的方法在避免MNMT中的负迁移效应方面是有效的,从而提高了翻译性能(平均BLEU分数约为1)。 摘要:Multilingual Neural Machine Translation (MNMT) trains a single NMT model that supports translation between multiple languages, rather than training separate models for different languages. Learning a single model can enhance the low-resource translation by leveraging data from multiple languages. However, the performance of an MNMT model is highly dependent on the type of languages used in training, as transferring knowledge from a diverse set of languages degrades the translation performance due to negative transfer. In this paper, we propose a Hierarchical Knowledge Distillation (HKD) approach for MNMT which capitalises on language groups generated according to typological features and phylogeny of languages to overcome the issue of negative transfer. HKD generates a set of multilingual teacher-assistant models via a selective knowledge distillation mechanism based on the language groups, and then distils the ultimate multilingual model from those assistants in an adaptive way. Experimental results derived from the TED dataset with 53 languages demonstrate the effectiveness of our approach in avoiding the negative transfer effect in MNMT, leading to an improved translation performance (about 1 BLEU score on average) compared to strong baselines.

【7】 Alternative Input Signals Ease Transfer in Multilingual Machine Translation 标题:多语言机器翻译中的可选输入信号简化了传输 链接:https://arxiv.org/abs/2110.07804

作者:Simeng Sun,Angela Fan,James Cross,Vishrav Chaudhary,Chau Tran,Philipp Koehn,Francisco Guzman 摘要:最近在多语言机器翻译(MMT)方面的工作集中于语言之间正迁移的潜力,特别是在资源丰富的语言可以使资源较低的语言受益的情况下。在训练MMT模型时,从一种语言对学习到的监控信号可以通过多个源语言共享的令牌传输到另一种语言对。然而,当源语言之间的标记重叠很小时,传输被禁止,这在语言使用不同的书写系统时自然表现出来。在本文中,我们通过使用替代信号(统一不同的书写系统,如拼音、罗马和音译输入)来增加训练数据,从而解决抑制的迁移问题。我们用印度语和突厥语测试这些信号,这两种语言的书写系统不同,但语言仍有共同的特征。我们的结果表明,直接的多源自集成——在各种信号的混合上训练一个模型,并在推理过程中对同一个模型输入不同信号的输出进行集成,在两种语言族上都比强集成基线高1.3 BLEU点。此外,我们发现,当训练集很小时,通过自集成合并替代输入特别有效,当只有5%的总训练数据可访问时,导致 5 BLEU。最后,我们的分析表明,包含替代信号会产生更高的一致性,并更准确地转换命名实体,这对于提高自动化系统的真实性至关重要。 摘要:Recent work in multilingual machine translation (MMT) has focused on the potential of positive transfer between languages, particularly cases where higher-resourced languages can benefit lower-resourced ones. While training an MMT model, the supervision signals learned from one language pair can be transferred to the other via the tokens shared by multiple source languages. However, the transfer is inhibited when the token overlap among source languages is small, which manifests naturally when languages use different writing systems. In this paper, we tackle inhibited transfer by augmenting the training data with alternative signals that unify different writing systems, such as phonetic, romanized, and transliterated input. We test these signals on Indic and Turkic languages, two language families where the writing systems differ but languages still share common features. Our results indicate that a straightforward multi-source self-ensemble -- training a model on a mixture of various signals and ensembling the outputs of the same model fed with different signals during inference, outperforms strong ensemble baselines by 1.3 BLEU points on both language families. Further, we find that incorporating alternative inputs via self-ensemble can be particularly effective when training set is small, leading to 5 BLEU when only 5% of the total training data is accessible. Finally, our analysis demonstrates that including alternative signals yields more consistency and translates named entities more accurately, which is crucial for increased factuality of automated systems.

语义分析(3篇)

【1】 Hierarchical Curriculum Learning for AMR Parsing 标题:面向AMR解析的分层课程学习 链接:https://arxiv.org/abs/2110.07855

作者:Peiyi Wang,Liang Chen,Tianyu Liu,Baobao Chang,Zhifang Sui 机构: Key Laboratory of Computational Linguistics, Peking University, MOE, China, Tencent Cloud Xiaowei 摘要:抽象意义表示(AMR)分析将句子转换为具有层次结构的语义表示,这是最近由预训练编码器-解码器模型赋予的。然而,扁平句到AMR训练范式阻碍了更深层次的AMR子图中概念和关系的表征学习。为了使序列到序列模型更好地适应固有的AMR结构,我们提出了一种由(1)结构级课程(SC)和(2)实例级课程(IC)组成的分层课程学习(HCL)。SC在训练期间逐步从浅AMR子图切换到深AMR子图,而IC在训练期间从易AMR实例过渡到难AMR实例。大量实验表明,使用HCL训练的BART在AMR-2.0和AMR-3.0基准上达到了最先进的性能,并且在结构相关评估指标和硬实例上显著优于基线。 摘要:Abstract Meaning Representation (AMR) parsing translates sentences to the semantic representation with a hierarchical structure, which is recently empowered by pretrained encoder-decoder models. However, the flat sentence-to-AMR training paradigm impedes the representation learning of concepts and relations in the deeper AMR sub-graph. To make the sequence-to-sequence models better adapt to the inherent AMR structure, we propose a hierarchical curriculum learning (HCL) which consists of (1) structure-level curriculum (SC) and (2) instance-level curriculum (IC). SC switches progressively from shallow to deep AMR sub-graphs while IC transits from easy to hard AMR instances during training. Extensive experiments show that BART trained with HCL achieves the state-of-the-art performance on the AMR-2.0 and AMR-3.0 benchmark, and significantly outperforms baselines on the structure-dependent evaluation metrics and hard instances.

【2】 End-to-End Segmentation-based News Summarization 标题:基于端到端分词的新闻摘要 链接:https://arxiv.org/abs/2110.07850

作者:Yang Liu,Chenguang Zhu,Michael Zeng 机构:Microsoft Cognitive Services Research Group 摘要:在本文中,我们引入了一种新的方法来消化新闻内容,将一篇新闻文章分成多个部分,并为每个部分生成相应的摘要。我们对这项新任务作出了两项贡献。首先,我们创建并提供了一个数据集SegNews,它由27k篇新闻文章组成,其中包含章节和对齐标题样式的章节摘要。其次,我们提出了一种新的基于分段的语言生成模型,该模型采用预先训练的语言模型,可以联合对文档进行分段并生成每个部分的摘要。在SegNews上的实验结果表明,对于这项新任务,我们的模型可以优于几种最先进的序列到序列生成模型。 摘要:In this paper, we bring a new way of digesting news content by introducing the task of segmenting a news article into multiple sections and generating the corresponding summary to each section. We make two contributions towards this new task. First, we create and make available a dataset, SegNews, consisting of 27k news articles with sections and aligned heading-style section summaries. Second, we propose a novel segmentation-based language generation model adapted from pre-trained language models that can jointly segment a document and produce the summary for each section. Experimental results on SegNews demonstrate that our model can outperform several state-of-the-art sequence-to-sequence generation models for this new task.

【3】 Cascaded Fast and Slow Models for Efficient Semantic Code Search 标题:用于高效语义代码搜索的级联快慢模型 链接:https://arxiv.org/abs/2110.07811

作者:Akhilesh Deepak Gotmare,Junnan Li,Shafiq Joty,Steven C. H. Hoi 机构:Salesforce Research Asia 备注:12 pages 摘要:自然语言语义代码搜索的目标是使用自然语言查询从一组固定的候选代码中检索语义相关的代码片段。现有的方法对于实际的语义代码搜索系统既不有效也不够高效。在本文中,我们提出了一个高效准确的语义代码搜索框架,该框架采用了快速和慢速级联模型,其中,学习快速Transformer编码器模型以优化用于快速检索的可伸缩索引,然后学习基于慢速分类的重新排序模型以提高快速检索的前K个结果的性能。为了进一步降低在实践中部署两个独立模型的高内存成本,我们建议基于共享参数的单个Transformer编码器联合训练快速和慢速模型。所提出的级联方法不仅高效且可扩展,而且实现了最先进的结果,平均互惠排名(MRR)得分为0.7795(跨6种编程语言),而CodeSearchNet基准上的最先进结果为0.713 MRR。 摘要:The goal of natural language semantic code search is to retrieve a semantically relevant code snippet from a fixed set of candidates using a natural language query. Existing approaches are neither effective nor efficient enough towards a practical semantic code search system. In this paper, we propose an efficient and accurate semantic code search framework with cascaded fast and slow models, in which a fast transformer encoder model is learned to optimize a scalable index for fast retrieval followed by learning a slow classification-based re-ranking model to improve the performance of the top K results from the fast retrieval. To further reduce the high memory cost of deploying two separate models in practice, we propose to jointly train the fast and slow model based on a single transformer encoder with shared parameters. The proposed cascaded approach is not only efficient and scalable, but also achieves state-of-the-art results with an average mean reciprocal ranking (MRR) score of 0.7795 (across 6 programming languages) as opposed to the previous state-of-the-art result of 0.713 MRR on the CodeSearchNet benchmark.

Graph|知识图谱|Knowledge(2篇)

【1】 Rewire-then-Probe: A Contrastive Recipe for Probing Biomedical Knowledge of Pre-trained Language Models 标题:REWIRE-THEN-PROBE:探索预先训练的语言模型的生物医学知识的对比配方 链接:https://arxiv.org/abs/2110.08173

作者:Zaiqiao Meng,Fangyu Liu,Ehsan Shareghi,Yixuan Su,Charlotte Collins,Nigel Collier 摘要:知识探索对于理解预训练语言模型(PLM)背后的知识转移机制至关重要。尽管PLM在一般领域的探索知识不断进步,但生物医学领域等专业领域的探索仍远远不够。为了推动这一方向的研究,我们发布了一个精心策划的生物医学知识探索基准MedLAMA,它是基于统一医学语言系统(UMLS)元词库构建的。我们在我们的基准上测试了一系列最先进的PLM和探测方法,最多达到3�c@10. 在强调导致这种令人印象深刻的性能的领域特定挑战的各种来源的同时,我们说明了底层PLM在探测任务方面具有更高的潜力。为了实现这一点,我们提出了对比探测,一种新的自我监督对比探测方法,它可以在不使用任何探测数据的情况下调整底层PLM。而对比探测则推动了acc@10到28%,绩效差距仍然显著。我们的人类专家评估表明,我们对比探针的探测性能仍然被低估,因为UMLS仍然不包括全部事实知识。我们希望MedLAMA和对比探针有助于进一步开发更适合该领域的探测技术。 摘要:Knowledge probing is crucial for understanding the knowledge transfer mechanism behind the pre-trained language models (PLMs). Despite the growing progress of probing knowledge for PLMs in the general domain, specialised areas such as biomedical domain are vastly under-explored. To catalyse the research in this direction, we release a well-curated biomedical knowledge probing benchmark, MedLAMA, which is constructed based on the Unified Medical Language System (UMLS) Metathesaurus. We test a wide spectrum of state-of-the-art PLMs and probing approaches on our benchmark, reaching at most 3% of acc@10. While highlighting various sources of domain-specific challenges that amount to this underwhelming performance, we illustrate that the underlying PLMs have a higher potential for probing tasks. To achieve this, we propose Contrastive-Probe, a novel self-supervised contrastive probing approach, that adjusts the underlying PLMs without using any probing data. While Contrastive-Probe pushes the acc@10 to 28%, the performance gap still remains notable. Our human expert evaluation suggests that the probing performance of our Contrastive-Probe is still under-estimated as UMLS still does not include the full spectrum of factual knowledge. We hope MedLAMA and Contrastive-Probe facilitate further developments of more suited probing techniques for this domain.

【2】 Multilingual Speech Recognition using Knowledge Transfer across Learning Processes 标题:使用跨学习过程的知识转移的多语言语音识别 链接:https://arxiv.org/abs/2110.07909

作者:Rimita Lahiri,Kenichi Kumatani,Eric Sun,Yao Qian 机构:Signal Analysis and Interpretation Laboratory, University of Southern California, USA, Microsoft Corp., USA 备注:5 pages 摘要:多语言端到端(E2E)模型在扩展自动语音识别(ASR)领域的语言覆盖方面显示出巨大的潜力。在本文中,我们旨在通过两种方式提高多语言ASR的性能,1)研究输入识别语言的一个热向量的影响,2)结合自我监督学习(SSL)制定具有元学习目标的任务。我们将每种语言与不同的任务流形相关联,并试图通过在学习过程中传递知识来提高性能,而不是通过最终的模型参数来传递。我们通过最小化与预期梯度路径长度相关的目标,在由6种语言组成的数据集上为域内ASR任务使用此策略。实验结果表明,最佳的预训练策略可使总体WER相对降低3.55%。当使用语言ID时,LEAP和SSL的组合可使总体WER相对减少3.51%。 摘要:Multilingual end-to-end(E2E) models have shown a great potential in the expansion of the language coverage in the realm of automatic speech recognition(ASR). In this paper, we aim to enhance the multilingual ASR performance in two ways, 1)studying the impact of feeding a one-hot vector identifying the language, 2)formulating the task with a meta-learning objective combined with self-supervised learning (SSL). We associate every language with a distinct task manifold and attempt to improve the performance by transferring knowledge across learning processes itself as compared to transferring through final model parameters. We employ this strategy on a dataset comprising of 6 languages for an in-domain ASR task, by minimizing an objective related to expected gradient path length. Experimental results reveal the best pre-training strategy resulting in 3.55% relative reduction in overall WER. A combination of LEAP and SSL yields 3.51% relative reduction in overall WER when using language ID.

摘要|信息提取(5篇)

【1】 DYLE: Dynamic Latent Extraction for Abstractive Long-Input Summarization 标题:Dyle:面向抽象式长输入摘要的动态潜在抽取 链接:https://arxiv.org/abs/2110.08168

作者:Ziming Mao,Chen Henry Wu,Ansong Ni,Yusen Zhang,Rui Zhang,Tao Yu,Budhaditya Deb,Chenguang Zhu,Ahmed H. Awadallah,Dragomir Radev 机构: Yale University, Carnegie Mellon University, Penn State University, The University of Hong Kong, Microsoft Research 摘要:基于Transformer的模型在短文本摘要方面取得了最先进的性能。然而,他们仍然难以进行长时间的输入汇总。在本文中,我们提出了一种新的长输入摘要方法:抽象摘要的动态潜在提取。我们联合训练提取器和摘要器,并将提取的文本片段视为潜在变量。我们建议使用抽取预言器为抽取器提供强大的学习信号。我们引入了一致性损失,这鼓励提取器近似于生成器预测的平均动态权重。我们对两个长输入摘要数据集GovReport(文档)和QMSum(对话)进行了广泛的测试。我们的模型显著优于当前最先进的模型,包括GovReport的6.21 ROUGE-2改进和QMSum的2.13 ROUGE-1改进。进一步的分析表明,动态权重使我们的生成过程具有高度的可解释性。我们的代码将在发布后公开。 摘要:Transformer-based models have achieved state-of-the-art performance on short text summarization. However, they still struggle with long-input summarization. In this paper, we present a new approach for long-input summarization: Dynamic Latent Extraction for Abstractive Summarization. We jointly train an extractor with an abstractor and treat the extracted text snippets as the latent variable. We propose extractive oracles to provide the extractor with a strong learning signal. We introduce consistency loss, which encourages the extractor to approximate the averaged dynamic weights predicted by the generator. We conduct extensive tests on two long-input summarization datasets, GovReport (document) and QMSum (dialogue). Our model significantly outperforms the current state-of-the-art, including a 6.21 ROUGE-2 improvement on GovReport and a 2.13 ROUGE-1 improvement on QMSum. Further analysis shows that the dynamic weights make our generation process highly interpretable. Our code will be publicly available upon publication.

【2】 Integrating diverse extraction pathways using iterative predictions for Multilingual Open Information Extraction 标题:集成多种提取路径的迭代预测多语言开放信息抽取 链接:https://arxiv.org/abs/2110.08144

作者:Bhushan Kotnis,Kiril Gashteovski,Carolin Lawrence,Daniel Oñoro Rubio,Vanesa Rodriguez-Tembras,Makoto Takamoto,Mathias Niepert 机构:NEC Laboratories Europe, Heidelberg, Germany., Heidelberg University, Center for Iberoamerican Studies, Germany. 摘要:在本文中,我们研究了开放信息提取(OpenIE)任务的一个简单假设,即如果提取是以更容易提取的先前提取为条件,则提取三元组的某些元素可能更容易。我们成功地利用了这一点,并提出了一个神经多语言OpenIE系统,该系统通过在三元组的不同元素上调节提取来迭代提取三元组,从而产生丰富的提取集。MiLIE的迭代特性还允许将基于规则的提取系统与神经端到端系统无缝集成,从而提高性能。MiLIE在从汉语到加利西亚语的多种语言上都优于SOTA系统,这得益于它能够结合多种提取途径。我们的分析证实,提取的某些元素确实比其他元素更容易提取。最后,我们介绍了两种低资源语言,即日语和加利西亚语的OpenIE评估数据集。 摘要:In this paper we investigate a simple hypothesis for the Open Information Extraction (OpenIE) task, that it may be easier to extract some elements of an triple if the extraction is conditioned on prior extractions which may be easier to extract. We successfully exploit this and propose a neural multilingual OpenIE system that iteratively extracts triples by conditioning extractions on different elements of the triple leading to a rich set of extractions. The iterative nature of MiLIE also allows for seamlessly integrating rule based extraction systems with a neural end-to-end system leading to improved performance. MiLIE outperforms SOTA systems on multiple languages ranging from Chinese to Galician thanks to it's ability of combining multiple extraction pathways. Our analysis confirms that it is indeed true that certain elements of an extraction are easier to extract than others. Finally, we introduce OpenIE evaluation datasets for two low resource languages namely Japanese and Galician.

【3】 Bridging the Gap: Cross-Lingual Summarization with Compression Rate 标题:弥合鸿沟:带压缩率的跨语言摘要 链接:https://arxiv.org/abs/2110.07936

作者:Yu Bai,Heyan Huang,Kai Fan,Yang Gao,Zewen Chi,Boxing Chen 机构:School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China, Machine Intelligence Technology Lab, Alibaba DAMO Academy 备注:Work in progress 摘要:跨语言摘要(CLS)将文档转换为跨语言摘要,与机器翻译(MT)任务密切相关。然而,机器翻译资源在CLS任务中仍然没有得到充分利用。在本文中,我们提出了一个新的任务,即具有压缩率的跨语言摘要(CSC),以便通过大规模机器翻译语料库进行跨语言摘要。通过引入压缩率,我们将MT任务视为一个特殊的CLS任务,压缩率为100%。因此,他们可以作为一个统一的任务进行训练,更有效地共享知识。此外,为了顺利完成这两项任务,我们提出了一种简单而有效的数据扩充方法,以生成具有不同压缩率的文档摘要对。该方法不仅提高了CLS任务的性能,而且提供了生成所需长度摘要的可控性。实验表明,我们的方法优于各种强基线。 摘要:Cross-lingual Summarization (CLS), converting a document into a cross-lingual summary, is highly related to Machine Translation (MT) task. However, MT resources are still underutilized for the CLS task. In this paper, we propose a novel task, Cross-lingual Summarization with Compression rate (CSC), to benefit cross-lingual summarization through large-scale MT corpus. Through introducing compression rate, we regard MT task as a special CLS task with the compression rate of 100%. Hence they can be trained as a unified task, sharing knowledge more effectively. Moreover, to bridge these two tasks smoothly, we propose a simple yet effective data augmentation method to produce document-summary pairs with different compression rates. The proposed method not only improves the performance of CLS task, but also provides controllability to generate summaries in desired lengths. Experiments demonstrate that our method outperforms various strong baselines.

【4】 Modeling Endorsement for Multi-Document Abstractive Summarization 标题:面向多文档摘要的建模批注 链接:https://arxiv.org/abs/2110.07844

作者:Logan Lebanoff,Bingqing Wang,Zhe Feng,Fei Liu 机构:†University of Central Florida, Orlando, FL, §Robert Bosch LLC, Sunnyvale, CA 备注:EMNLP 2021 Workshop on New Frontiers in Summarization 摘要:单文档摘要和多文档摘要之间的一个关键区别在于文档中突出的内容是如何表现出来的。虽然此类内容可能出现在单个文档的开头,但与特定主题相关的一组文档中经常会重复基本信息,从而产生背书效应,增加信息的显著性。在本文中,我们对跨文档背书效应及其在多文档摘要中的应用进行了建模。我们的方法从每个文档生成一个概要,作为背书人从其他文档中识别重要内容。强烈认可的文本段用于丰富神经编码器-解码器模型,以将其合并为抽象摘要。该方法有很大的潜力,可以从较少的示例中学习以识别显著的内容,从而在动态调整文档集时,减少了昂贵的再训练需求。通过在基准多文档摘要数据集上的大量实验,我们证明了我们提出的方法在强发布基线上的有效性。最后,我们阐明了未来的研究方向,并通过案例研究讨论了这项任务面临的更广泛挑战。 摘要:A crucial difference between single- and multi-document summarization is how salient content manifests itself in the document(s). While such content may appear at the beginning of a single document, essential information is frequently reiterated in a set of documents related to a particular topic, resulting in an endorsement effect that increases information salience. In this paper, we model the cross-document endorsement effect and its utilization in multiple document summarization. Our method generates a synopsis from each document, which serves as an endorser to identify salient content from other documents. Strongly endorsed text segments are used to enrich a neural encoder-decoder model to consolidate them into an abstractive summary. The method has a great potential to learn from fewer examples to identify salient content, which alleviates the need for costly retraining when the set of documents is dynamically adjusted. Through extensive experiments on benchmark multi-document summarization datasets, we demonstrate the effectiveness of our proposed method over strong published baselines. Finally, we shed light on future research directions and discuss broader challenges of this task using a case study.

【5】 Making Document-Level Information Extraction Right for the Right Reasons 标题:有理有据地做好文档级信息抽取 链接:https://arxiv.org/abs/2110.07686

作者:Liyan Tang,Dhruv Rajan,Suyash Mohan,Abhijeet Pradhan,R. Nick Bryan,Greg Durrett 机构:The University of Texas at Austin, University of Pennsylvania, Galileo CDS Inc. 备注:9 pages (14 with references and appendix), 3 figures 摘要:文档级信息提取是一个灵活的框架,适用于信息不一定在一句话中本地化的应用程序。例如,放射学报告中诊断的关键特征可能没有明确说明,但可以从报告文本中推断出来。然而,文档级神经模型可以很容易地从无关信息中学习虚假的相关性。这项工作研究如何确保这些模型从复杂文本中做出正确的推断,并以可审计的方式做出这些推断:除了正确之外,这些模型是否“出于正确的原因”正确?我们使用特征归因技术在预测-选择-验证框架中进行事后证据提取实验。虽然这种基本方法可以提取合理的证据,但可以在训练期间通过少量的证据监督加以规范,这大大提高了提取证据的质量。我们在两个领域进行评估:一个脑MRI报告的小规模标记数据集和DocRED的大规模修改版本(Yao等人,2019年),并表明模型的合理性可以在不损失准确性的情况下得到提高。 摘要:Document-level information extraction is a flexible framework compatible with applications where information is not necessarily localized in a single sentence. For example, key features of a diagnosis in radiology a report may not be explicitly stated, but nevertheless can be inferred from the report's text. However, document-level neural models can easily learn spurious correlations from irrelevant information. This work studies how to ensure that these models make correct inferences from complex text and make those inferences in an auditable way: beyond just being right, are these models "right for the right reasons?" We experiment with post-hoc evidence extraction in a predict-select-verify framework using feature attribution techniques. While this basic approach can extract reasonable evidence, it can be regularized with small amounts of evidence supervision during training, which substantially improves the quality of extracted evidence. We evaluate on two domains: a small-scale labeled dataset of brain MRI reports and a large-scale modified version of DocRED (Yao et al., 2019) and show that models' plausibility can be improved with no loss in accuracy.

推理|分析|理解|解释(1篇)

【1】 Span Detection for Aspect-Based Sentiment Analysis in Vietnamese 标题:基于方面的越南语情感分析中的跨度检测 链接:https://arxiv.org/abs/2110.07833

作者:Kim Thi-Thanh Nguyen,Sieu Khai Huynh,Luong Luc Phan,Phuc Huynh Pham,Duc-Vu Nguyen,Kiet Van Nguyen 机构:University of Information Technology, Ho Chi Minh City, Vietnam, Vietnam National University, Ho Chi Minh City, Vietnam 摘要:基于方面的情感分析在自然语言处理和人工智能中起着至关重要的作用。目前,研究者们只关注方面检测和情感分类,而忽略了用户意见广度检测这一在实际应用中具有巨大潜力的子任务。在本文中,我们提出了一个新的越南数据集(UIT-ViSD4SA),该数据集由35396个人类注释跨度和11122条反馈意见组成,用于评估基于方面的情感分析中的跨度检测。此外,我们还提出了一种基于条件随机场(CRF)层(BiLSTM CRF)的双向长短时记忆(BiLSTM)系统,用于越南基于方面的情感分析中的跨度检测任务。最好的结果是使用BiLSTM CRF进行跨度检测的F1分数(宏)为62.76%,嵌入融合了XLM RoBERTa的音节嵌入、字符嵌入和上下文嵌入。在未来的工作中,广度检测将扩展到许多NLP任务中,如构造性检测、情感识别、投诉分析和意见挖掘。我们的数据集可在https://github.com/kimkim00/UIT-ViSD4SA 用于研究目的。 摘要:Aspect-based sentiment analysis plays an essential role in natural language processing and artificial intelligence. Recently, researchers only focused on aspect detection and sentiment classification but ignoring the sub-task of detecting user opinion span, which has enormous potential in practical applications. In this paper, we present a new Vietnamese dataset (UIT-ViSD4SA) consisting of 35,396 human-annotated spans on 11,122 feedback comments for evaluating the span detection in aspect-based sentiment analysis. Besides, we also propose a novel system using Bidirectional Long Short-Term Memory (BiLSTM) with a Conditional Random Field (CRF) layer (BiLSTM-CRF) for the span detection task in Vietnamese aspect-based sentiment analysis. The best result is a 62.76% F1 score (macro) for span detection using BiLSTM-CRF with embedding fusion of syllable embedding, character embedding, and contextual embedding from XLM-RoBERTa. In future work, span detection will be extended in many NLP tasks such as constructive detection, emotion recognition, complaint analysis, and opinion mining. Our dataset is freely available at https://github.com/kimkim00/UIT-ViSD4SA for research purposes.

GAN|对抗|攻击|生成相关(4篇)

【1】 Guiding Visual Question Generation 标题:引导可视化问题生成 链接:https://arxiv.org/abs/2110.08226

作者:Nihir Vedd,Zixu Wang,Marek Rei,Yishu Miao,Lucia Specia 机构:Department of Computing, Imperial College London 备注:11 pages including references and Appendix. 3 figures and 3 tables 摘要:在传统的视觉问题生成(VQG)中,大多数图像都有多个概念(例如对象和类别),可以为其生成问题,但模型经过训练以模仿训练数据中给出的任意概念选择。这使得训练变得困难,也给评估带来了问题——对于大多数图像,存在多个有效问题,但只有一个或几个问题被人类参考捕获。我们提出了引导性视觉问题生成——VQG的一种变体,它根据对问题类型及其应该探究的对象的期望,根据分类信息对问题生成器进行调节。我们提出了两种变体:(i)一种显式引导模型,使参与者(人工或自动)能够选择生成问题的对象和类别;和(ii)一个隐式引导模型,该模型基于离散的潜在变量,学习哪些对象和类别作为条件。建议的模型在答案类别增强的VQA数据集上进行了评估,我们的定量结果显示,与目前的技术水平相比,有了实质性的改进(增加了9个BLEU-4)。人类评估验证了指导有助于生成语法连贯且与给定图像和对象相关的问题。 摘要:In traditional Visual Question Generation (VQG), most images have multiple concepts (e.g. objects and categories) for which a question could be generated, but models are trained to mimic an arbitrary choice of concept as given in their training data. This makes training difficult and also poses issues for evaluation -- multiple valid questions exist for most images but only one or a few are captured by the human references. We present Guiding Visual Question Generation - a variant of VQG which conditions the question generator on categorical information based on expectations on the type of question and the objects it should explore. We propose two variants: (i) an explicitly guided model that enables an actor (human or automated) to select which objects and categories to generate a question for; and (ii) an implicitly guided model that learns which objects and categories to condition on, based on discrete latent variables. The proposed models are evaluated on an answer-category augmented VQA dataset and our quantitative results show a substantial improvement over the current state of the art (over 9 BLEU-4 increase). Human evaluation validates that guidance helps the generation of questions that are grammatically coherent and relevant to the given image and objects.

【2】 Jurassic is (almost) All You Need: Few-Shot Meaning-to-Text Generation for Open-Domain Dialogue 标题:侏罗纪是(几乎)你所需要的:为开放领域对话生成几个镜头的文本意义 链接:https://arxiv.org/abs/2110.08094

作者:Lena Reed,Cecilia Li,Angela Ramirez,Liren Wu,Marilyn Walker 机构: One type of RG generatesUniversity of California Santa Cruz 备注:The 12th International Workshop on Spoken Dialog System Technology, IWSDS 2021 摘要:开放领域对话系统面临的一个挑战是需要对任何主题做出高质量的回应。我们的目标是提高雅典娜(Alexa Prize对话系统)的质量和覆盖面。我们利用雅典娜的响应生成器(RGs)为电影、音乐、电视、体育和视频游戏领域的两个新的神经意义文本RGs(雅典娜GPT Neo和雅典娜侏罗纪)创建训练数据。我们在域内和域间进行了一些快照实验,使用不同的调整集大小(2、3、10)、提示格式和WikiData KG三元组的意义表示(MRs),并使用14种可能的属性组合执行对话行为。我们的评估使用了BLEURT和人类评估指标,结果表明,通过10次镜头调整,Athena Jurassic的连贯性和语义准确性显著提高。在完全新颖的MRs上进行2次调谐的实验导致Athena GPT Neo的性能大幅下降,其语义准确度下降到0.41,不真实幻觉率上升到12%。对视频游戏中的对话行为进行的实验表明,通过10次镜头调谐,两种模型都能学会控制对话行为,但雅典娜·侏罗纪的连贯性要高得多,只有4%的幻觉是不真实的。我们的结果表明,Athena Jurassic能够可靠地为具有真实用户的实时系统提供高质量的输出。据我们所知,这些是第一个结果,表明在大规模语言模型上进行少量的镜头调整可以创建推广到新领域的NLG,并直接从MRs和KG三元组生成高质量、语义受控的会话响应。 摘要:One challenge with open-domain dialogue systems is the need to produce high-quality responses on any topic. We aim to improve the quality and coverage of Athena, an Alexa Prize dialogue system. We utilize Athena's response generators (RGs) to create training data for two new neural Meaning-to-Text RGs, Athena-GPT-Neo and Athena-Jurassic, for the movies, music, TV, sports, and video game domains. We conduct few-shot experiments, both within and cross-domain, with different tuning set sizes (2, 3, 10), prompt formats, and meaning representations (MRs) for sets of WikiData KG triples, and dialogue acts with 14 possible attribute combinations. Our evaluation uses BLEURT and human evaluation metrics, and shows that with 10-shot tuning, Athena-Jurassic's performance is significantly better for coherence and semantic accuracy. Experiments with 2-shot tuning on completely novel MRs results in a huge performance drop for Athena-GPT-Neo, whose semantic accuracy falls to 0.41, and whose untrue hallucination rate increases to 12%. Experiments with dialogue acts for video games show that with 10-shot tuning, both models learn to control dialogue acts, but Athena-Jurassic has significantly higher coherence, and only 4% untrue hallucinations. Our results suggest that Athena-Jurassic can reliably produce outputs of high-quality for live systems with real users. To our knowledge, these are the first results demonstrating that few-shot tuning on a massive language model can create NLGs that generalize to new domains, and produce high-quality, semantically-controlled, conversational responses directly from MRs and KG triples.

【3】 Generating Natural Language Adversarial Examples through An Improved Beam Search Algorithm 标题:一种改进的波束搜索算法生成自然语言对抗性实例 链接:https://arxiv.org/abs/2110.08036

作者:Tengfei Zhao,Zhaocheng Ge,Hanping Hu,Dingmeng Shi 机构: School of Artifcial Intelligence and Automation, Huazhong University of Science and Technology, Key Laboratory of Image Information Processing and Intelligent Control, Ministry of Education 备注:9 pages, 4 figures 摘要:近年来,文本领域的对抗性攻击研究引起了人们的广泛关注,并提出了许多攻击成功率较高的方法。然而,这些攻击方法效率低下,因为它们在制作文本对抗性示例时需要对受害者模型进行大量查询。本文提出了一种新的攻击模型,其攻击成功率超过了基准攻击方法,但更重要的是,其攻击效率远远高于基准攻击方法。通过在四个基准数据集上攻击WordCNN、LSTM、BiLSTM和BERT,对新方法进行了实证评估。例如,在IMDB上攻击BERT和BiLSTM时,它的攻击成功率比最先进的方法高100%,但对受害者模型的查询数量仅为最先进方法的1/4和1/6.5。进一步的实验表明,该方法在生成的对抗性示例上具有良好的可移植性。 摘要:The research of adversarial attacks in the text domain attracts many interests in the last few years, and many methods with a high attack success rate have been proposed. However, these attack methods are inefficient as they require lots of queries for the victim model when crafting text adversarial examples. In this paper, a novel attack model is proposed, its attack success rate surpasses the benchmark attack methods, but more importantly, its attack efficiency is much higher than the benchmark attack methods. The novel method is empirically evaluated by attacking WordCNN, LSTM, BiLSTM, and BERT on four benchmark datasets. For instance, it achieves a 100% attack success rate higher than the state-of-the-art method when attacking BERT and BiLSTM on IMDB, but the number of queries for the victim models only is 1/4 and 1/6.5 of the state-of-the-art method, respectively. Also, further experiments show the novel method has a good transferability on the generated adversarial examples.

【4】 Hindsight: Posterior-guided training of retrievers for improved open-ended generation 标题:后见之明:为改进开放式世代对寻回犬的后导训练 链接:https://arxiv.org/abs/2110.07752

作者:Ashwin Paranjape,Omar Khattab,Christopher Potts,Matei Zaharia,Christopher D. Manning 机构:Stanford University 摘要:许多文本生成系统受益于使用检索器从文本知识语料库(如维基百科)中检索文章,然后将其作为附加上下文提供给生成器。对于开放式生成任务(如在对话中生成信息性话语)许多不同的段落可能具有同等的相关性,我们发现,联合训练检索器和生成器的现有方法表现不佳:即使在前10名中,检索器也可能找不到相关的段落,因此生成器可能不喜欢将其生成的输出放在其中。我们建议使用一个额外的引导检索器,该检索器允许使用目标输出并在训练期间“事后”检索相关段落。我们根据给定输入和目标输出的段落后验分布Q对引导检索器进行建模,并通过最大化证据下限(ELBo)期望值超过Q,与标准检索器和生成器一起对其进行训练。对于来自Wikipedia数据集向导的信息对话,使用后验引导训练,检索器在前10名中找到相关性更高的段落(23%相对改善),生成器的响应在检索到的段落中更为可靠(19%相对改善),端到端系统产生更好的总体输出(6.4%相对改善)。 摘要:Many text generation systems benefit from using a retriever to retrieve passages from a textual knowledge corpus (e.g., Wikipedia) which are then provided as additional context to the generator. For open-ended generation tasks (like generating informative utterances in conversations) many varied passages may be equally relevant and we find that existing methods that jointly train the retriever and generator underperform: the retriever may not find relevant passages even amongst the top-10 and hence the generator may not learn a preference to ground its generated output in them. We propose using an additional guide retriever that is allowed to use the target output and "in hindsight" retrieve relevant passages during training. We model the guide retriever after the posterior distribution Q of passages given the input and the target output and train it jointly with the standard retriever and the generator by maximizing the evidence lower bound (ELBo) in expectation over Q. For informative conversations from the Wizard of Wikipedia dataset, with posterior-guided training, the retriever finds passages with higher relevance in the top-10 (23% relative improvement), the generator's responses are more grounded in the retrieved passage (19% relative improvement) and the end-to-end system produces better overall output (6.4% relative improvement).

半/弱/无监督|不确定性(1篇)

【1】 Don't speak too fast: The impact of data bias on self-supervised speech models 标题:不要说得太快:数据偏差对自我监督语音模型的影响 链接:https://arxiv.org/abs/2110.07957

作者:Yen Meng,Yi-Hui Chou,Andy T. Liu,Hung-yi Lee 机构:⋆Graduate Institute of Communication Engineering, National Taiwan University, †College of Electrical Engineering and Computer Science, National Taiwan University 备注:Submitted to ICASSP 2022 摘要:自监督语音模型(S3M)已被证明在许多语音下游任务中是成功的,如ASR。然而,训练前数据如何影响S3M的下游行为仍然是一个未探索的问题。在本文中,我们通过针对不同语音因素(包括性别、内容和韵律)的有偏数据集上的预训练模型,研究预训练数据对S3M的影响,并在SUPERB Benchmark中对这些预训练S3M进行下游任务评估。我们的实验表明,S3M对性别偏见具有耐受性。此外,我们发现语音内容对S3Ms在下游任务中的性能几乎没有影响,但S3Ms确实倾向于使用较慢的语音速率。 摘要:Self-supervised Speech Models (S3Ms) have been proven successful in many speech downstream tasks, like ASR. However, how pre-training data affects S3Ms' downstream behavior remains an unexplored issue. In this paper, we study how pre-training data affects S3Ms by pre-training models on biased datasets targeting different factors of speech, including gender, content, and prosody, and evaluate these pre-trained S3Ms on selected downstream tasks in SUPERB Benchmark. Our experiments show that S3Ms have tolerance toward gender bias. Moreover, we find that the content of speech has little impact on the performance of S3Ms across downstream tasks, but S3Ms do show a preference toward a slower speech rate.

检测相关(1篇)

【1】 Is Stance Detection Topic-Independent and Cross-topic Generalizable? -- A Reproduction Study 标题:姿态检测是否是主题独立和跨主题的泛化?--一项再现性研究 链接:https://arxiv.org/abs/2110.07693

作者:Myrthe Reuver,Suzan Verberne,Roser Morante,Antske Fokkens 机构:Computation Linguistics and Text Mining Lab, Vrije Universiteit Amsterdam, Dept. of Mathematics and Computer Science, Eindhoven University of Technology, Leiden Institute of Advanced Computer Science, Leiden University 备注:Accepted at the 8th Workshop on Argument Mining, 2021 co-located with EMNLP 2021. Cite the published version 摘要:跨主题立场检测的任务是自动检测不可见主题上的立场(赞成、反对或中立)。我们成功地再现了最先进的跨主题姿态检测工作(Reimers等人,2019年),并系统地分析了其再现性。然后,我们的注意力转向这项工作的跨主题方面,以及主题在词汇和社会文化背景方面的特殊性。我们问:姿态检测在多大程度上是独立于主题的,并且可以跨主题概括?我们比较了模型在各种看不见的主题上的性能,发现主题(如堕胎、克隆)、类(如赞成、反对)以及它们之间影响模型性能的交互作用。我们得出结论,调查不同主题的表现,解决特定主题的词汇和上下文,是跨主题立场检测的未来途径。 摘要:Cross-topic stance detection is the task to automatically detect stances (pro, against, or neutral) on unseen topics. We successfully reproduce state-of-the-art cross-topic stance detection work (Reimers et. al., 2019), and systematically analyze its reproducibility. Our attention then turns to the cross-topic aspect of this work, and the specificity of topics in terms of vocabulary and socio-cultural context. We ask: To what extent is stance detection topic-independent and generalizable across topics? We compare the model's performance on various unseen topics, and find topic (e.g. abortion, cloning), class (e.g. pro, con), and their interaction affecting the model's performance. We conclude that investigating performance on different topics, and addressing topic-specific vocabulary and context, is a future avenue for cross-topic stance detection.

识别/分类(1篇)

【1】 A Multilingual Bag-of-Entities Model for Zero-Shot Cross-Lingual Text Classification 标题:一种面向零命中率跨语言文本分类的多语种实体袋模型 链接:https://arxiv.org/abs/2110.07792

作者:Sosuke Nishikawa,Ikuya Yamada,Yoshimasa Tsuruoka,Isao Echizen 机构:The University of Tokyo ,Studio Ousia, National Institute of Informatics ,RIKEN AIP 备注:11 pages, 3 figures 摘要:我们提出了一种多语言实体袋模型,该模型通过扩展多语言预训练语言模型(如M-BERT),有效地提高了Zero-Shot跨语言文本分类的性能。它利用了Wikidata的多语言特性:表示同一概念的多语言实体使用唯一标识符定义。这使得以多种语言描述的实体能够使用共享嵌入来表示。因此,基于资源丰富的语言中的实体特征训练的模型可以直接应用于其他语言。我们在跨语言主题分类(使用MLDoc和TED-CLDC数据集)和实体类型(使用SHINRA2020-ML数据集)方面的实验结果表明,所提出的模型始终优于最先进的模型。 摘要:We present a multilingual bag-of-entities model that effectively boosts the performance of zero-shot cross-lingual text classification by extending a multilingual pre-trained language model (e.g., M-BERT). It leverages the multilingual nature of Wikidata: entities in multiple languages representing the same concept are defined with a unique identifier. This enables entities described in multiple languages to be represented using shared embeddings. A model trained on entity features in a resource-rich language can thus be directly applied to other languages. Our experimental results on cross-lingual topic classification (using the MLDoc and TED-CLDC datasets) and entity typing (using the SHINRA2020-ML dataset) show that the proposed model consistently outperforms state-of-the-art models.

Zero/Few/One-Shot|迁移|自适应(2篇)

【1】 Multitask Prompted Training Enables Zero-Shot Task Generalization 标题:多任务提示式训练实现零命中任务泛化 链接:https://arxiv.org/abs/2110.08207

作者:Victor Sanh,Albert Webson,Colin Raffel,Stephen H. Bach,Lintang Sutawika,Zaid Alyafeai,Antoine Chaffin,Arnaud Stiegler,Teven Le Scao,Arun Raja,Manan Dey,M Saiful Bari,Canwen Xu,Urmish Thakker,Shanya Sharma Sharma,Eliza Szczechla,Taewoon Kim,Gunjan Chhablani,Nihal Nayak,Debajyoti Datta,Jonathan Chang,Mike Tian-Jian Jiang,Han Wang,Matteo Manica,Sheng Shen,Zheng Xin Yong,Harshit Pandey,Rachel Bawden,Thomas Wang,Trishala Neeraj,Jos Rozen,Abheesht Sharma,Andrea Santilli,Thibault Fevry,Jason Alan Fries,Ryan Teehan,Stella Biderman,Leo Gao,Tali Bers,Thomas Wolf,Alexander M. Rush 机构:Hugging Face, Brown University, BigScience, KFUPM, IRISA, IMATAG, Hyperscience, I,R, ASTAR, SAP, NTU, UCSDHF, SambaNova Systems, Shanya Sharma, Walmart Labs, VU Amsterdam, Nihal V. Nayak, University of Virginia, ASUS, ZEALS, NYU, IBM Research, UC Berkeley, Zheng-Xin Yong, Michael McKenna, Parity 备注:this https URL 摘要:大型语言模型最近被证明可以在不同的任务集上实现合理的零炮泛化。有人假设这是语言模型训练中内隐多任务学习的结果。零炮泛化能直接由显式多任务学习产生吗?为了大规模地测试这个问题,我们开发了一个系统,可以轻松地将一般自然语言任务映射为人类可读的提示形式。我们使用不同的自然语言转换一大组有监督的数据集,每个数据集都有多个提示。这些提示数据集允许对模型执行自然语言中指定的完全看不见的任务的能力进行基准测试。我们在此多任务混合上微调了一个预训练编码器-解码器模型,该模型涵盖了各种各样的任务。该模型在多个标准数据集上实现了强大的零炮性能,通常比其尺寸大16倍的模型表现更好。此外,我们的方法在大型基准测试中的一部分任务上获得了很好的性能,超过了6倍于其规模的模型。所有提示和经过训练的模型均可在github.com/bigscience workshop/promptsource/上获得。 摘要:Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks. It has been hypothesized that this is a consequence of implicit multitask learning in language model training. Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping general natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts using varying natural language. These prompted datasets allow for benchmarking the ability of a model to perform completely unseen tasks specified in natural language. We fine-tune a pretrained encoder-decoder model on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-Bench benchmark, outperforming models 6x its size. All prompts and trained models are available at github.com/bigscience-workshop/promptsource/.

【2】 SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer 标题:SPOT:通过软提示传输实现更好的冻结模型适配 链接:https://arxiv.org/abs/2110.07904

作者:Tu Vu,Brian Lester,Noah Constant,Rami Al-Rfou,Daniel Cer 机构:Google Research, University of Massachusetts Amherst 备注:20 pages, 6 figures, 5 tables 摘要:随着预先训练的语言模型变得越来越大,将这些模型应用于下游任务的参数有效方法越来越受到关注。基于Lester et al.(2021)的Promptuning方法,学习特定于任务的软提示来调节冻结语言模型以执行下游任务,我们提出了一种新的基于提示的迁移学习方法,称为SPoT:软提示迁移。SPoT首先学习一个或多个源任务的提示,然后使用它初始化目标任务的提示。我们发现SPoT显著提高了许多任务的PrompTuning性能。更重要的是,SPoT与ModelTuning匹配或优于ModelTuning,ModelTuning可以在所有模型大小上对整个模型进行微调,同时参数效率更高(任务特定参数最多减少27000倍)。我们进一步对26个NLP任务和160个源-目标任务组合的任务可转移性进行了大规模研究,并证明了任务通常可以通过即时转移相互受益。最后,我们提出了一种简单而有效的检索方法,将任务提示解释为任务嵌入,以识别任务之间的相似性,并预测给定新目标任务中最可转移的源任务。 摘要:As pre-trained language models have gotten larger, there has been growing interest in parameter-efficient methods to apply these models to downstream tasks. Building on the PromptTuning approach of Lester et al. (2021), which learns task-specific soft prompts to condition a frozen language model to perform downstream tasks, we propose a novel prompt-based transfer learning approach called SPoT: Soft Prompt Transfer. SPoT first learns a prompt on one or more source tasks and then uses it to initialize the prompt for a target task. We show that SPoT significantly boosts the performance of PromptTuning across many tasks. More importantly, SPoT either matches or outperforms ModelTuning, which fine-tunes the entire model on each individual task, across all model sizes while being more parameter-efficient (up to 27,000x fewer task-specific parameters). We further conduct a large-scale study on task transferability with 26 NLP tasks and 160 combinations of source-target tasks, and demonstrate that tasks can often benefit each other via prompt transfer. Finally, we propose a simple yet efficient retrieval approach that interprets task prompts as task embeddings to identify the similarity between tasks and predict the most transferable source tasks for a given novel target task.

表征(3篇)

【1】 mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models 标题:mLUKE:多语言预训练语言模型中实体表示的能力 链接:https://arxiv.org/abs/2110.08151

作者:Ryokan Ri,Ikuya Yamada,Yoshimasa Tsuruoka 机构:Studio Ousia, Tokyo, Japan, The University of Tokyo, Tokyo, Japan 摘要:最近的研究表明,使用维基百科实体的跨语言对齐信息可以有效地改进多语言预训练语言模型。然而,现有的方法仅在预训练中利用实体信息,在下游任务中不显式使用实体。在这项研究中,我们探讨了在下游跨语言任务中利用实体表示的有效性。我们训练了一个包含24种实体表示语言的多语言模型,结果表明该模型在各种跨语言迁移任务中始终优于基于单词的预训练模型。我们还分析了模型,关键的洞察是,将实体表示合并到输入中允许我们提取更多与语言无关的特征。我们还使用mLAMA数据集的多语言完形填空提示任务来评估模型。我们表明,基于实体的提示比仅使用单词表示更可能引出正确的事实知识。 摘要:Recent studies have shown that multilingual pretrained language models can be effectively improved with cross-lingual alignment information from Wikipedia entities. However, existing methods only exploit entity information in pretraining and do not explicitly use entities in downstream tasks. In this study, we explore the effectiveness of leveraging entity representations for downstream cross-lingual tasks. We train a multilingual language model with 24 languages with entity representations and show the model consistently outperforms word-based pretrained models in various cross-lingual transfer tasks. We also analyze the model and the key insight is that incorporating entity representations into the input allows us to extract more language-agnostic features. We also evaluate the model with a multilingual cloze prompt task with the mLAMA dataset. We show that entity-based prompt elicits correct factual knowledge more likely than using only word representations.

【2】 Modeling Proficiency with Implicit User Representations 标题:使用隐式用户表示熟练建模 链接:https://arxiv.org/abs/2110.08011

作者:Kim Breitwieser,Allison Lahnala,Charles Welch,Lucie Flek,Martin Potthast 机构:†TU Darmstadt, ‡University of Marburg, §Leipzig University 摘要:我们介绍熟练程度建模问题:给定用户在社交媒体平台上的帖子,任务是确定用户具有一定熟练程度的帖子或主题子集。这可以根据用户熟练程度对给定主题的社交媒体帖子进行筛选和排名。与某一特定主题的专家不同,熟练的用户可能没有接受过正式训练并拥有多年的实践经验,但可能是自学成才、业余爱好者和持续感兴趣的人,使他们能够对话语做出真正的原创贡献。虽然预测用户是否是某一特定主题的专家会对谁是真正的积极参与者施加很大的限制,但熟练程度建模意味着评分,从而放松了这些限制。换句话说,许多活跃的社交媒体用户可以被认为拥有或最终获得与社区相关主题的某种程度的熟练程度。我们以无监督的方式处理熟练程度建模,方法是利用用户嵌入对给定主题的参与进行建模,如用户对创作相关内容的偏好所示。我们调查了五种模型熟练程度的替代方法,从基本方法到高级、定制的用户建模方法,应用于两个真实世界的评估基准。 摘要:We introduce the problem of proficiency modeling: Given a user's posts on a social media platform, the task is to identify the subset of posts or topics for which the user has some level of proficiency. This enables the filtering and ranking of social media posts on a given topic as per user proficiency. Unlike experts on a given topic, proficient users may not have received formal training and possess years of practical experience, but may be autodidacts, hobbyists, and people with sustained interest, enabling them to make genuine and original contributions to discourse. While predicting whether a user is an expert on a given topic imposes strong constraints on who is a true positive, proficiency modeling implies a graded scoring, relaxing these constraints. Put another way, many active social media users can be assumed to possess, or eventually acquire, some level of proficiency on topics relevant to their community. We tackle proficiency modeling in an unsupervised manner by utilizing user embeddings to model engagement with a given topic, as indicated by a user's preference for authoring related content. We investigate five alternative approaches to model proficiency, ranging from basic ones to an advanced, tailored user modeling approach, applied within two real-world benchmarks for evaluation.

【3】 Socially Aware Bias Measurements for Hindi Language Representations 标题:印地语表征的社会意识偏差测量 链接:https://arxiv.org/abs/2110.07871

作者:Vijit Malik,Sunipa Dev,Akihiro Nishi,Nanyun Peng,Kai-Wei Chang 备注:11 Pages (5 Pages main content 1 pages for references 5 Pages Appendix) 摘要:语言表征是NLP中使用的有效工具,但它们与编码的社会偏见相冲突。这些偏见已被广泛研究,但主要集中在西方社会中常见的英语语言表达和偏见上。在这项工作中,我们调查了印地语表达中存在的偏见,如种姓和宗教相关偏见。我们展示了偏见是如何根据其广泛使用的地区的历史和文化独特于特定语言表达的,以及在跨语言调查时相同的社会偏见(如二元性别相关偏见)是如何由不同的单词和文本跨度编码的。通过这项工作,我们强调在建模语言表征时,社会意识以及语言和语法人工制品的必要性,以便理解编码的偏见。 摘要:Language representations are an efficient tool used across NLP, but they are strife with encoded societal biases. These biases are studied extensively, but with a primary focus on English language representations and biases common in the context of Western society. In this work, we investigate the biases present in Hindi language representations such as caste and religion associated biases. We demonstrate how biases are unique to specific language representations based on the history and culture of the region they are widely spoken in, and also how the same societal bias (such as binary gender associated biases) when investigated across languages is encoded by different words and text spans. With this work, we emphasize on the necessity of social-awareness along with linguistic and grammatical artefacts when modeling language representations, in order to understand the biases encoded.

Word2Vec|文本|单词(4篇)

【1】 Textual Backdoor Attacks Can Be More Harmful via Two Simple Tricks 标题:通过两个简单的技巧,文本后门攻击的危害可能更大 链接:https://arxiv.org/abs/2110.08247

作者:Yangyi Chen,Fanchao Qi,Zhiyuan Liu,Maosong Sun 机构:Department of Computer Science and Technology, Tsinghua University, Beijing, China, Beijing National Research Center for Information Science and Technology, Institute for Artificial Intelligence, Tsinghua University, Beijing, China 备注:Work in progress 摘要:后门攻击是深度学习中的一种紧急安全威胁。当一个深度神经模型被注入后门时,它将在标准输入上正常运行,但一旦输入包含特定的后门触发器,它将给出对手指定的预测。当前的文本后门攻击在某些困难情况下具有较差的攻击性能。在本文中,我们发现两个简单的技巧可以使现有的文本后门攻击更加有害。第一个技巧是在受害者模型的训练过程中增加一个额外的训练任务来区分中毒数据和干净数据,第二个技巧是使用所有干净的训练数据,而不是删除中毒数据对应的原始干净数据。这两个技巧普遍适用于不同的攻击模型。我们在三种困难的情况下进行实验,包括干净的数据微调、低中毒率和标签一致性攻击。实验结果表明,这两种技巧可以显著提高攻击性能。本文展示了后门攻击的巨大潜在危害。所有代码和数据将公开,以便于进一步研究。 摘要:Backdoor attacks are a kind of emergent security threat in deep learning. When a deep neural model is injected with a backdoor, it will behave normally on standard inputs but give adversary-specified predictions once the input contains specific backdoor triggers. Current textual backdoor attacks have poor attack performance in some tough situations. In this paper, we find two simple tricks that can make existing textual backdoor attacks much more harmful. The first trick is to add an extra training task to distinguish poisoned and clean data during the training of the victim model, and the second one is to use all the clean training data rather than remove the original clean data corresponding to the poisoned data. These two tricks are universally applicable to different attack models. We conduct experiments in three tough situations including clean data fine-tuning, low poisoning rate, and label-consistent attacks. Experimental results show that the two tricks can significantly improve attack performance. This paper exhibits the great potential harmfulness of backdoor attacks. All the code and data will be made public to facilitate further research.

【2】 Cross-Domain Data Integration for Named Entity Disambiguation in Biomedical Text 标题:跨域数据集成在生物医学文本命名实体消歧中的应用 链接:https://arxiv.org/abs/2110.08228

作者:Maya Varma,Laurel Orr,Sen Wu,Megan Leszczynski,Xiao Ling,Christopher Ré 机构:Stanford University, Apple 备注:Accepted to Findings of EMNLP 2021 摘要:命名实体消歧(NED)涉及将文本提及映射到结构化实体,由于罕见实体的存在,在医学领域尤其具有挑战性。现有方法受到生物医学知识库中粗粒度结构资源的限制,以及对不常见资源覆盖率较低的训练数据集的使用。在这项工作中,我们通过提出一种跨领域数据集成方法来解决这些问题,该方法将结构知识从通用文本知识库转移到医学领域。我们利用我们的集成方案来增加结构资源,并生成一个用于预训练的大型生物医学NED数据集。我们的预训练模型结合注入的结构知识,在两个基准医学NED数据集(MedNeds和BC5CDR)上实现了最先进的性能。此外,我们将罕见实体的消歧提高了57个精度点。 摘要:Named entity disambiguation (NED), which involves mapping textual mentions to structured entities, is particularly challenging in the medical domain due to the presence of rare entities. Existing approaches are limited by the presence of coarse-grained structural resources in biomedical knowledge bases as well as the use of training datasets that provide low coverage over uncommon resources. In this work, we address these issues by proposing a cross-domain data integration method that transfers structural knowledge from a general text knowledge base to the medical domain. We utilize our integration scheme to augment structural resources and generate a large biomedical NED dataset for pretraining. Our pretrained model with injected structural knowledge achieves state-of-the-art performance on two benchmark medical NED datasets: MedMentions and BC5CDR. Furthermore, we improve disambiguation of rare entities by up to 57 accuracy points.

【3】 Scribosermo: Fast Speech-to-Text models for German and other Languages 标题:Scribosermo:德语和其他语言的快速语音到文本模型 链接:https://arxiv.org/abs/2110.07982

作者:Daniel Bermuth,Alexander Poeppel,Wolfgang Reif 机构:University of Augsburg, Institute for Software & Systems Engineering 摘要:最近的语音到文本模型通常需要大量的硬件资源,并且大多是用英语训练的。本文介绍了德语、西班牙语和法语的语音到文本模型,具有以下特点:(a)它们体积小,在微控制器(如树莓)上实时运行。(b) 使用预先训练的英语模型,他们可以在相对较小的数据集的消费级硬件上接受训练。(c) 该模型与其他解决方案相比具有竞争力,在德国的表现优于其他解决方案。在这方面,这些模型结合了其他方法的优点,这些方法只包括所呈现特征的子集。此外,本文还提供了一个用于处理数据集的新库,该库侧重于使用附加数据集进行简单扩展,并展示了一种优化方法,用于使用预训练模型从具有类似字母表的另一种语言迁移学习新语言。 摘要:Recent Speech-to-Text models often require a large amount of hardware resources and are mostly trained in English. This paper presents Speech-to-Text models for German, as well as for Spanish and French with special features: (a) They are small and run in real-time on microcontrollers like a RaspberryPi. (b) Using a pretrained English model, they can be trained on consumer-grade hardware with a relatively small dataset. (c) The models are competitive with other solutions and outperform them in German. In this respect, the models combine advantages of other approaches, which only include a subset of the presented features. Furthermore, the paper provides a new library for handling datasets, which is focused on easy extension with additional datasets and shows an optimized way for transfer-learning new languages using a pretrained model from another language with a similar alphabet.

【4】 Large Scale Substitution-based Word Sense Induction 标题:基于大规模替换的词义归纳 链接:https://arxiv.org/abs/2110.07681

作者:Matan Eyal,Shoval Sadde,Hillel Taub-Tabib,Yoav Goldberg 机构:Allen Institute for AI, Tel Aviv, Israel, †Bar Ilan University, Ramat-Gan, Israel 摘要:我们提出了一种基于预训练蒙面语言模型(MLM)的词义归纳方法,该方法可以廉价地扩展到大型词汇表和大型语料库。结果是一个语料库,根据语料库衍生的词义清单对词义进行标记,其中每个词义都与指示词相关联。使用我们的方法对英语维基百科进行语义标记的评估表明,即使与WSD方法(如Babelfy)相比,诱导语义和每个实例的语义分配都具有较高的质量。此外,通过在语义标记语料库上训练静态单词嵌入算法,我们获得了高质量的静态语义嵌入。在WiC数据集和我们开发的一个新的离群点检测数据集上,这些性能优于现有的SENSOUVE嵌入技术。该算法的数据驱动特性允许归纳特定于语料库的意义,而这些意义可能不会出现在标准意义清单中,正如我们在科学领域的案例研究中所展示的那样。 摘要:We present a word-sense induction method based on pre-trained masked language models (MLMs), which can cheaply scale to large vocabularies and large corpora. The result is a corpus which is sense-tagged according to a corpus-derived sense inventory and where each sense is associated with indicative words. Evaluation on English Wikipedia that was sense-tagged using our method shows that both the induced senses, and the per-instance sense assignment, are of high quality even compared to WSD methods, such as Babelfy. Furthermore, by training a static word embeddings algorithm on the sense-tagged corpus, we obtain high-quality static senseful embeddings. These outperform existing senseful embeddings techniques on the WiC dataset and on a new outlier detection dataset we developed. The data driven nature of the algorithm allows to induce corpora-specific senses, which may not appear in standard sense inventories, as we demonstrate using a case study on the scientific domain.

其他神经网络|深度学习|模型|建模(7篇)

【1】 Intent-based Product Collections for E-commerce using Pretrained Language Models 标题:使用预先训练的语言模型的电子商务中基于意图的产品集合 链接:https://arxiv.org/abs/2110.08241

作者:Hiun Kim,Jisu Jeong,Kyung-Min Kim,Dongjun Lee,Hyun Dong Lee,Dongpil Seo,Jeeseung Han,Dong Wook Park,Ji Ae Heo,Rak Yeong Kim 机构:∗NAVER CLOVA, †NAVER AI LAB, ‡LBox Co., Ltd. 备注:Accepted to IEEE International Workshop on Data Mining for Service (DMS2021) 摘要:建立一个购物产品集合主要是一个人的工作。通过手工制作的努力,专家们收集了具有共同购物目的的相关但多样的产品,这些产品在一起展示时非常有效,例如背包、笔记本电脑包和新生袋礼物的信使包。自动构建集合需要一个ML系统来了解客户意图和产品属性之间的复杂关系。然而,也有一些具有挑战性的地方,比如1)冗长复杂的意图句,2)丰富多样的产品属性,以及3)它们之间的巨大语义鸿沟,这使得问题变得困难。在本文中,我们使用预训练语言模型(PLM),该模型利用web级产品的文本属性来创建基于意图的产品集合。具体来说,我们通过将意图句设置为锚,并将相应的结果设置为正面示例,来训练具有三重态丢失的BERT。此外,我们还通过基于搜索的负采样和类别的正对扩充来提高模型的性能。对于离线评估中基于意图的产品匹配,我们的模型明显优于基于搜索的基线模型。此外,在我们的电子商务平台上的在线实验结果表明,基于PLM的方法可以构建产品集合,与专家精心制作的集合相比,具有更高的CTR、CVR和订单多样性。 摘要:Building a shopping product collection has been primarily a human job. With the manual efforts of craftsmanship, experts collect related but diverse products with common shopping intent that are effective when displayed together, e.g., backpacks, laptop bags, and messenger bags for freshman bag gifts. Automatically constructing a collection requires an ML system to learn a complex relationship between the customer's intent and the product's attributes. However, there have been challenging points, such as 1) long and complicated intent sentences, 2) rich and diverse product attributes, and 3) a huge semantic gap between them, making the problem difficult. In this paper, we use a pretrained language model (PLM) that leverages textual attributes of web-scale products to make intent-based product collections. Specifically, we train a BERT with triplet loss by setting an intent sentence to an anchor and corresponding products to positive examples. Also, we improve the performance of the model by search-based negative sampling and category-wise positive pair augmentation. Our model significantly outperforms the search-based baseline model for intent-based product matching in offline evaluations. Furthermore, online experimental results on our e-commerce platform show that the PLM-based method can construct collections of products with increased CTR, CVR, and order-diversity compared to expert-crafted collections.

【2】 The World of an Octopus: How Reporting Bias Influences a Language Model's Perception of Color 标题:章鱼的世界:报道偏见如何影响语言模型对颜色的感知 链接:https://arxiv.org/abs/2110.08182

作者:Cory Paik,Stéphane Aroca-Ouellette,Alessandro Roncone,Katharina Kann 机构:University of Colorado Boulder 备注:Accepted to EMNLP 2021, 9 Pages 摘要:最近的工作引起了人们对纯文本训练固有局限性的关注。在本文中,我们首先证明了报告偏差,即人们不陈述明显情况的倾向,是造成这一限制的原因之一,然后调查多模式训练在多大程度上可以缓解这一问题。为此,我们1)生成颜色数据集(CoDa),这是521个常见对象的人类感知颜色分布数据集;2) 使用CoDa分析和比较文本中发现的颜色分布、语言模型捕获的分布以及人类对颜色的感知;3)调查纯文本模型和多模态模型在尾波方面的性能差异。我们的研究结果表明,语言模型恢复的颜色分布与文本中发现的不准确分布的相关性比与基本事实的相关性更强,支持报告偏见对纯文本训练产生负面影响和内在限制的说法。然后,我们证明了多模态模型可以利用其视觉训练来缓解这些影响,为未来的研究提供了一个有希望的途径。 摘要:Recent work has raised concerns about the inherent limitations of text-only pretraining. In this paper, we first demonstrate that reporting bias, the tendency of people to not state the obvious, is one of the causes of this limitation, and then investigate to what extent multimodal training can mitigate this issue. To accomplish this, we 1) generate the Color Dataset (CoDa), a dataset of human-perceived color distributions for 521 common objects; 2) use CoDa to analyze and compare the color distribution found in text, the distribution captured by language models, and a human's perception of color; and 3) investigate the performance differences between text-only and multimodal models on CoDa. Our results show that the distribution of colors that a language model recovers correlates more strongly with the inaccurate distribution found in text than with the ground-truth, supporting the claim that reporting bias negatively impacts and inherently limits text-only training. We then demonstrate that multimodal models can leverage their visual training to mitigate these effects, providing a promising avenue for future research.

【3】 Learning Semantics: An Opportunity for Effective 6G Communications 标题:学习语义:有效开展6G通信的机会 链接:https://arxiv.org/abs/2110.08049

作者:Mohamed Sana,Emilio Calvanese Strinati 机构:CEA-Leti, Universit´e Grenoble Alpes, F-, Grenoble, France 备注:Accepted for publication at IEEE CCNC 2021 摘要:最近,语义通信被认为是未来6G网络的关键促成因素。回到香农的信息理论,通信的目标长期以来一直是保证正确接收传输的信息,而不管其含义如何。然而,一般来说,每当通信发生以传达意义时,重要的是接收者对传输的消息的理解,而不一定是其正确的重构。因此,语义通信引入了一种新的模式:只传输足够让接收者捕捉到预期含义的相关信息可以节省大量的通信带宽。因此,这项工作探索了语义通信为5G网络以外的网络提供的机会。我们特别关注语义压缩的好处。我们将语义信息称为从“意义”基础数据中学习到的一系列格式良好的符号,这些符号必须在接收者处进行解释。这需要一个基于知识库的推理单元(这里是人工的):特定应用程序的符号知识表示。因此,我们提出并详细介绍了一种新的架构,该架构支持语义符号的表示学习,以实现有效的语义通信。我们首先讨论了理论方面,并成功地设计了目标函数,这有助于学习有效的语义编码器和解码器。最后,我们对文本传输的情况给出了有希望的数值结果,特别是当发送方和接收方使用不同的语言时。 摘要:Recently, semantic communications are envisioned as a key enabler of future 6G networks. Back to Shannon's information theory, the goal of communication has long been to guarantee the correct reception of transmitted messages irrespective of their meaning. However, in general, whenever communication occurs to convey a meaning, what matters is the receiver's understanding of the transmitted message and not necessarily its correct reconstruction. Hence, semantic communications introduce a new paradigm: transmitting only relevant information sufficient for the receiver to capture the meaning intended can save significant communication bandwidth. Thus, this work explores the opportunity offered by semantic communications for beyond 5G networks. In particular, we focus on the benefit of semantic compression. We refer to semantic message as a sequence of well-formed symbols learned from the "meaning" underlying data, which have to be interpreted at the receiver. This requires a reasoning unit, here artificial, on a knowledge base: a symbolic knowledge representation of the specific application. Therefore, we present and detail a novel architecture that enables representation learning of semantic symbols for effective semantic communications. We first discuss theoretical aspects and successfully design objective functions, which help learn effective semantic encoders and decoders. Eventually, we show promising numerical results for the scenario of text transmission, especially when the sender and receiver speak different languages.

【4】 RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models 标题:RAP:用于防御NLP模型后门攻击的鲁棒性感知扰动 链接:https://arxiv.org/abs/2110.07831

作者:Wenkai Yang,Yankai Lin,Peng Li,Jie Zhou,Xu Sun 机构:Center for Data Science, Peking University, Pattern Recognition Center, WeChat AI, Tencent Inc., China, MOE Key Laboratory of Computational Linguistics, School of EECS, Peking University 备注:EMNLP 2021 (main conference), long paper, camera-ready version 摘要:后门攻击恶意地控制训练有素的模型的实例输出和特定的触发器,最近被证明是对重用深度神经网络(DNN)安全的严重威胁。在这项工作中,我们提出了一种基于鲁棒性感知扰动的有效在线防御机制。具体来说,通过分析后门训练过程,我们指出有毒样本和干净样本之间存在很大的鲁棒性差距。基于这一观察,我们构造了一个基于单词的鲁棒性感知扰动来区分有毒样本和干净样本,以抵御对自然语言处理(NLP)模型的后门攻击。此外,我们还从理论上分析了基于鲁棒性感知的扰动防御方法的可行性。在情感分析和毒性检测任务上的实验结果表明,与现有的在线防御方法相比,该方法具有更好的防御性能和更低的计算成本。我们的代码可在https://github.com/lancopku/RAP. 摘要:Backdoor attacks, which maliciously control a well-trained model's outputs of the instances with specific triggers, are recently shown to be serious threats to the safety of reusing deep neural networks (DNNs). In this work, we propose an efficient online defense mechanism based on robustness-aware perturbations. Specifically, by analyzing the backdoor training process, we point out that there exists a big gap of robustness between poisoned and clean samples. Motivated by this observation, we construct a word-based robustness-aware perturbation to distinguish poisoned samples from clean samples to defend against the backdoor attacks on natural language processing (NLP) models. Moreover, we give a theoretical analysis about the feasibility of our robustness-aware perturbation-based defense method. Experimental results on sentiment analysis and toxic detection tasks show that our method achieves better defending performance and much lower computational costs than existing online defense methods. Our code is available at https://github.com/lancopku/RAP.

【5】 Meta-learning via Language Model In-context Tuning 标题:通过语境中的语言模型调整实现元学习 链接:https://arxiv.org/abs/2110.07814

作者:Yanda Chen,Ruiqi Zhong,Sheng Zha,George Karypis,He He 机构:Columbia University,University of California, Berkeley,AWS AI, New York University 摘要:元学习的目标是学会适应一项新任务,只需要几个标记的例子。为了解决NLP中的这个问题,我们提出了$textit{in context tuning}$,它将自适应和预测重新定义为一个简单的序列预测问题:为了形成输入序列,我们将任务指令、标记的示例和目标输入连接起来进行预测;为了对模型进行元训练以从上下文示例中学习,我们对预先训练的语言模型(LM)进行微调,以根据任务集合上的输入序列预测目标标签。我们在两个文本分类任务集合上对我们的方法进行了基准测试:LAMA和BinaryClfs。与采用梯度下降模型的一阶MAML相比,我们的方法更好地利用LMs的诱导偏差来执行模式匹配,并且在二元CLF上的绝对$6%$AUC ROC分数优于MAML,并且增加了模型大小的优势。与未微调的上下文学习(即提示原始LM)相比,上下文调整直接学习从上下文示例中学习。在BinaryClfs上,上下文内调优将平均AUC-ROC分数提高了10%$,并将示例排序的方差降低了6倍,示例选择的方差降低了2倍。 摘要:The goal of meta-learning is to learn to adapt to a new task with only a few labeled examples. To tackle this problem in NLP, we propose $textit{in-context tuning}$, which recasts adaptation and prediction as a simple sequence prediction problem: to form the input sequence, we concatenate the task instruction, the labeled examples, and the target input to predict; to meta-train the model to learn from in-context examples, we fine-tune a pre-trained language model (LM) to predict the target label from the input sequences on a collection of tasks. We benchmark our method on two collections of text classification tasks: LAMA and BinaryClfs. Compared to first-order MAML which adapts the model with gradient descent, our method better leverages the inductive bias of LMs to perform pattern matching, and outperforms MAML by an absolute $6%$ AUC ROC score on BinaryClfs, with increasing advantage w.r.t. model size. Compared to non-fine-tuned in-context learning (i.e. prompting a raw LM), in-context tuning directly learns to learn from in-context examples. On BinaryClfs, in-context tuning improves the average AUC-ROC score by an absolute $10%$, and reduces the variance with respect to example ordering by 6x and example choices by 2x.

【6】 Identifying and Mitigating Spurious Correlations for Improving Robustness in NLP Models 标题:识别和消除伪相关性以提高NLP模型的稳健性 链接:https://arxiv.org/abs/2110.07736

作者:Tianlu Wang,Diyi Yang,Xuezhi Wang 机构:University of Virginia, Georgia Institute of Technology, Google Research 备注:8 pages 摘要:最近,NLP模型在各种任务中取得了显著进展;然而,它们也因不够稳健而受到批评。许多鲁棒性问题可归因于利用虚假相关性的模型,或训练数据和任务标签之间的快捷方式。如果在训练过程中利用虚假相关性,模型可能无法推广到分布外的数据,或者容易受到敌对攻击。在本文中,我们的目标是自动识别NLP模型中的这种虚假相关性。我们首先利用现有的可解释性方法从输入文本中提取显著影响模型决策过程的标记。然后,我们通过分析跨多个语料库的模型预测来区分“真实”标记和“虚假”标记,并通过知识感知扰动进一步验证它们。我们表明,我们提出的方法可以有效地识别一组可扩展的“快捷方式”,并且在多个应用程序中减少这些快捷方式会导致更健壮的模型。 摘要:Recently, NLP models have achieved remarkable progress across a variety of tasks; however, they have also been criticized for being not robust. Many robustness problems can be attributed to models exploiting spurious correlations, or shortcuts between the training data and the task labels. Models may fail to generalize to out-of-distribution data or be vulnerable to adversarial attacks if spurious correlations are exploited through the training process. In this paper, we aim to automatically identify such spurious correlations in NLP models at scale. We first leverage existing interpretability methods to extract tokens that significantly affect model's decision process from the input text. We then distinguish "genuine" tokens and "spurious" tokens by analyzing model predictions across multiple corpora and further verify them through knowledge-aware perturbations. We show that our proposed method can effectively and efficiently identify a scalable set of "shortcuts", and mitigating these leads to more robust models in multiple applications.

【7】 Sparks: Inspiration for Science Writing using Language Models 标题:火花:运用语言模式进行科学写作的灵感 链接:https://arxiv.org/abs/2110.07640

作者:Katy Ilonka Gero,Vivian Liu,Lydia B. Chilton 机构: Columbia University 摘要:大规模语言模型正在迅速改进,在各种各样的任务上表现良好,几乎没有定制。在这项工作中,我们研究了语言模型如何支持科学写作,这是一项具有挑战性的写作任务,既有开放性,又有高度限制性。我们提出了一个产生“火花”的系统,这是一个与科学概念相关的句子,旨在激励作家。我们发现,与竞争性语言模型基线相比,我们的sparks更加连贯和多样化,并且接近人类创造的金标准。在一项有13名博士生就自己选择的主题进行写作的研究中,我们发现sparks有三个主要的使用案例:帮助编写详细的句子,提供有趣的角度吸引读者,以及展示常见的读者观点。我们还报告了sparks被认为没有帮助的各种原因,并讨论了如何改进语言模型作为写作支持工具。 摘要:Large-scale language models are rapidly improving, performing well on a wide variety of tasks with little to no customization. In this work we investigate how language models can support science writing, a challenging writing task that is both open-ended and highly constrained. We present a system for generating "sparks", sentences related to a scientific concept intended to inspire writers. We find that our sparks are more coherent and diverse than a competitive language model baseline, and approach a human-created gold standard. In a study with 13 PhD students writing on topics of their own selection, we find three main use cases of sparks: aiding with crafting detailed sentences, providing interesting angles to engage readers, and demonstrating common reader perspectives. We also report on the various reasons sparks were considered unhelpful, and discuss how we might improve language models as writing support tools.

其他(14篇)

【1】 DialFact: A Benchmark for Fact-Checking in Dialogue 标题:DialFact:对话中事实核查的基准 链接:https://arxiv.org/abs/2110.08222

作者:Prakhar Gupta,Chien-Sheng Wu,Wenhao Liu,Caiming Xiong 机构:Salesforce AI Research‡, Language Technologies Institute, Carnegie Mellon University† 摘要:事实检查是减少错误信息和虚假信息传播的一个重要工具,然而,人们经常探索它来验证正式的单句声明,而不是随意的会话声明。为了研究这个问题,我们引入了对话中的事实核查任务。我们构建了DialFact,这是一个测试基准数据集,包含22245条带注释的会话声明,并与来自维基百科的证据进行了配对。DialFact中有三个子任务:1)可验证索赔检测任务区分响应是否包含可验证的事实信息;2) 证据检索任务检索最相关的维基百科片段作为证据;3) 索赔验证任务预测对话响应将得到支持、反驳或信息不足。我们发现,基于非对话数据(如FERE)的现有事实检查模型无法很好地执行我们的任务,因此,我们提出了一种简单但数据有效的解决方案,以有效提高对话中的事实检查性能。我们指出了DialFact中的独特挑战,如在错误分析中处理口语、共指和检索歧义,以阐明这一方向的未来研究。 摘要:Fact-checking is an essential tool to mitigate the spread of misinformation and disinformation, however, it has been often explored to verify formal single-sentence claims instead of casual conversational claims. To study the problem, we introduce the task of fact-checking in dialogue. We construct DialFact, a testing benchmark dataset of 22,245 annotated conversational claims, paired with pieces of evidence from Wikipedia. There are three sub-tasks in DialFact: 1) Verifiable claim detection task distinguishes whether a response carries verifiable factual information; 2) Evidence retrieval task retrieves the most relevant Wikipedia snippets as evidence; 3) Claim verification task predicts a dialogue response to be supported, refuted, or not enough information. We found that existing fact-checking models trained on non-dialogue data like FEVER fail to perform well on our task, and thus, we propose a simple yet data-efficient solution to effectively improve fact-checking performance in dialogue. We point out unique challenges in DialFact such as handling the colloquialisms, coreferences, and retrieval ambiguities in the error analysis to shed light on future research in this direction.

【2】 Towards Identity Preserving Normal to Dysarthric Voice Conversion 标题:论保持身份的正常到异步性语音转换 链接:https://arxiv.org/abs/2110.08213

作者:Wen-Chin Huang,Bence Mark Halpern,Lester Phillip Violeta,Odette Scharenborg,Tomoki Toda 机构:Nagoya University, Japan, Multimedia Computing Group, Delft University of Technology, Delft, The Netherlands, University of Amsterdam, Amsterdam, The Netherlands, Netherlands Cancer Institute, Amsterdam, The Netherlands 备注:Submitted to ICASSP 2022 摘要:我们提出了一个语音转换框架,将正常语音转换为构音障碍语音,同时保留说话人身份。这样一个框架对于(1)临床决策过程和减轻患者压力,(2)构音障碍语音识别的数据增强至关重要。这是一项特别具有挑战性的任务,因为转换后的样本应该能够捕捉到构音障碍语音的严重程度,同时具有高度的自然性和正常说话人的说话人身份。为此,我们采用了一个两阶段的框架,该框架由一个序列到序列模型和一个非平行的框架模型组成。对UASpeech数据集进行了客观和主观评估,结果表明,该方法能够产生合理的自然度并捕获病理语音的严重程度。另一方面,与正常源说话人声音的相似性有限,需要进一步改进。 摘要:We present a voice conversion framework that converts normal speech into dysarthric speech while preserving the speaker identity. Such a framework is essential for (1) clinical decision making processes and alleviation of patient stress, (2) data augmentation for dysarthric speech recognition. This is an especially challenging task since the converted samples should capture the severity of dysarthric speech while being highly natural and possessing the speaker identity of the normal speaker. To this end, we adopted a two-stage framework, which consists of a sequence-to-sequence model and a nonparallel frame-wise model. Objective and subjective evaluations were conducted on the UASpeech dataset, and results showed that the method was able to yield reasonable naturalness and capture severity aspects of the pathological speech. On the other hand, the similarity to the normal source speaker's voice was limited and requires further improvements.

【3】 Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm 标题:稀疏渐进精馏:解决预训练-微调范式下的过拟合问题 链接:https://arxiv.org/abs/2110.08190

作者:Shaoyi Huang,Dongkuan Xu,Ian E. H. Yen,Sung-en Chang,Bingbing Li,Shiyang Chen,Mimi Xie,Hang Liu,Caiwen Ding 机构:University of Connecticut,The Pennsylvania State University,Moffett AI, Northeastern University,Stevens Institute of Technology,University of Texas at San Antonio 摘要:已经提出了各种剪枝方法来减少基于Transformer的语言模型的占用空间需求。传统观点认为,修剪会降低模型的表达能力,因此与原始模型相比,更可能是欠拟合而不是过拟合。然而,在趋势预训练和微调范式下,我们认为如果在微调阶段执行修剪,修剪会增加过度拟合的风险,因为它增加了模型需要从下游任务中学习的信息量,从而导致相对数据不足。在本文中,我们的目标是解决预训练和精细调整范式下的过度拟合问题,通过渐进式知识提取(KD)和稀疏修剪来提高修剪性能。此外,为了减少学习速率、剪枝和提取策略之间的干扰,我们提出了一个三阶段学习框架。我们首次表明,在pretrain和finetune范式下,降低过度拟合的风险有助于修剪的有效性。在GLUE benchmark的多个数据集上的实验表明,我们的方法在不同的剪枝率约束条件下比最先进的竞争对手具有更高的剪枝性能。 摘要:Various pruning approaches have been proposed to reduce the footprint requirements of Transformer-based language models. Conventional wisdom is that pruning reduces the model expressiveness and thus is more likely to underfit than overfit compared to the original model. However, under the trending pretrain-and-finetune paradigm, we argue that pruning increases the risk of overfitting if pruning was performed at the fine-tuning phase, as it increases the amount of information a model needs to learn from the downstream task, resulting in relative data deficiency. In this paper, we aim to address the overfitting issue under the pretrain-and-finetune paradigm to improve pruning performance via progressive knowledge distillation (KD) and sparse pruning. Furthermore, to mitigate the interference between different strategies of learning rate, pruning and distillation, we propose a three-stage learning framework. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm. Experiments on multiple datasets of GLUE benchmark show that our method achieves highly competitive pruning performance over the state-of-the-art competitors across different pruning ratio constraints.

【4】 Kronecker Decomposition for GPT Compression 标题:用于GPT压缩的Kronecker分解 链接:https://arxiv.org/abs/2110.08152

作者:Ali Edalati,Marzieh Tahaei,Ahmad Rashid,Vahid Partovi Nia,James J. Clark,Mehdi Rezagholizadeh 机构: Huawei Noah Ark Lab, McGill University 摘要:GPT是一种基于自回归变换的预训练语言模型,由于其在多个下游任务中的先进性能,在自然语言处理(NLP)领域引起了广泛关注。GPT的成功主要归功于其对大量数据和大量参数(从~100M到数十亿个参数)的预训练。尽管GPT具有优越的性能(特别是在Few-Shot或Zero-Shot设置中),但GPT的这种过度参数化性质对于在计算能力或内存有限的设备上部署此模型来说是非常困难的。使用模型压缩技术可以缓解这个问题;然而,压缩GPT模型的研究在文献中并不多。在这项工作中,我们使用Kronecker分解来压缩GPT-22模型的线性映射。我们的Kronecker GPT-2模型(KnGPT2)是基于GPT-2模型的Kronecker分解版本初始化的,然后使用中间层知识提取(ILKD)仅对一小部分训练数据进行非常轻的预训练。最后,我们的KnGPT2也使用ILKD对下游任务进行了微调。我们在语言建模和一般语言理解评估基准任务上对我们的模型进行了评估,结果表明,通过更有效的预训练和相似数量的参数,我们的KnGPT2模型显著优于现有的DistilGPT2模型。 摘要:GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain due to its state-of-the-art performance in several downstream tasks. The success of GPT is mostly attributed to its pre-training on huge amount of data and its large number of parameters (from ~100M to billions of parameters). Despite the superior performance of GPT (especially in few-shot or zero-shot setup), this overparameterized nature of GPT can be very prohibitive for deploying this model on devices with limited computational power or memory. This problem can be mitigated using model compression techniques; however, compressing GPT models has not been investigated much in the literature. In this work, we use Kronecker decomposition to compress the linear mappings of the GPT-22 model. Our Kronecker GPT-2 model (KnGPT2) is initialized based on the Kronecker decomposed version of the GPT-2 model and then is undergone a very light pre-training on only a small portion of the training data with intermediate layer knowledge distillation (ILKD). Finally, our KnGPT2 is fine-tuned on down-stream tasks using ILKD as well. We evaluate our model on both language modeling and General Language Understanding Evaluation benchmark tasks and show that with more efficient pre-training and similar number of parameters, our KnGPT2 outperforms the existing DistilGPT2 model significantly.

【5】 UniDS: A Unified Dialogue System for Chit-Chat and Task-oriented Dialogues 标题:UniDS:一个面向聊天和任务型对话的统一对话系统 链接:https://arxiv.org/abs/2110.08032

作者:Xinyan Zhao,Bin He,Yasheng Wang,Yitong Li,Fei Mi,Yajiao Liu,Xin Jiang,Qun Liu,Huanhuan Chen 机构: University of Science and Technology of China, Huawei Noah’s Ark Lab 摘要:随着深度学习的进步,聊天对话系统和面向任务的对话系统已经取得了巨大的进步。然而,在目前的方法中,这两个系统通常是分开处理的。为了实现与人类更自然的交互,对话代理需要能够聊天和完成任务。为此,我们提出了一个具有上述两种技能的统一对话系统(UniDS)。特别是,我们设计了一个统一的对话数据模式,兼容聊天和面向任务的对话,并且我们使用预先训练的聊天对话模型中的混合对话数据来训练unid。在不向SOTA基线添加额外参数的情况下,UNID可以在统一的框架中处理聊天和面向任务的对话。实验结果表明,所提出的UniDS与纯聊天系统的性能相当,并且优于目前最先进的面向任务的对话系统。更重要的是,UniDS能够在两种类型的对话之间平滑切换,从而实现更好的健壮性。这些结果证明了建立一个人人参与的对话系统的可行性和潜力。 摘要:With the advances in deep learning, tremendous progress has been made with chit-chat dialogue systems and task-oriented dialogue systems. However, these two systems are often tackled separately in current methods. To achieve more natural interaction with humans, a dialogue agent needs to be capable of both chatting and accomplishing tasks. To this end, we propose a unified dialogue system (UniDS) with the two aforementioned skills. In particular, we design a unified dialogue data schema, compatible for both chit-chat and task-oriented dialogues, and we train UniDS with mixed dialogue data from a pretrained chit-chat dialogue model. Without adding extra parameters to SOTA baselines, UniDS can alternatively handle chit-chat and task-oriented dialogues in a unified framework. Experimental results demonstrate that the proposed UniDS works comparably well as the pure chit-chat system, and it outperforms state-of-the-art task-oriented dialogue systems. More importantly, UniDS achieves better robustness as it is able to smoothly switch between two types of dialogues. These results demonstrate the feasibility and potential of building an one-for-all dialogue system.

【6】 Structural Modeling for Dialogue Disentanglement 标题:对话解缠的结构建模 链接:https://arxiv.org/abs/2110.08018

作者:Xinbei Ma,Zhuosheng Zhang,Hai Zhao 机构:Department of Computer Science and Engineering, Shanghai Jiao Tong University, Key Laboratory of Shanghai Education Commission for Intelligent Interaction, and Cognitive Engineering, Shanghai Jiao Tong University 摘要:纠结的多方对话语境给对话阅读理解带来了挑战,在同一对话历史中,多条对话线索同时流动,从而增加了人类和机器理解对话历史的难度。对话解构旨在澄清多党对话史上的对话线索,从而降低理解长期混乱的对话段落的难度。现有的研究主要集中在使用精心设计的基于特征工程的方法对话语进行编码,但对对话结构的关注不够。本文设计了一个新的模型,通过考虑对话结构的特点,将多党历史分解为线索。具体来说,基于对话是通过说话人的连续参与和感兴趣的用户之间的交互来构建的事实,我们提取说话人属性和用户引用的线索来建模长对话记录的结构。在Ubuntu IRC数据集上对新方法进行了评估,并显示了对话解纠缠的最新实验结果。 摘要:Tangled multi-party dialogue context leads to challenges for dialogue reading comprehension, where multiple dialogue threads flow simultaneously within the same dialogue history, thus increasing difficulties in understanding a dialogue history for both human and machine. Dialogue disentanglement aims to clarify conversation threads in a multi-party dialogue history, thus reducing the difficulty of comprehending the long disordered dialogue passage. Existing studies commonly focus on utterance encoding with carefully designed feature engineering-based methods but pay inadequate attention to dialogue structure. This work designs a novel model to disentangle multi-party history into threads, by taking dialogue structure features into account. Specifically, based on the fact that dialogues are constructed through successive participation of speakers and interactions between users of interest, we extract clues of speaker property and reference of users to model the structure of a long dialogue record. The novel method is evaluated on the Ubuntu IRC dataset and shows state-of-the-art experimental results in dialogue disentanglement.

【7】 Tracing Origins: Coref-aware Machine Reading Comprehension 标题:溯源:Coref感知的机器阅读理解 链接:https://arxiv.org/abs/2110.07961

作者:Baorong Huang,Zhuosheng Zhang,Hai Zhao 机构:Institute of Corpus Studies and Applications, Shanghai International Studies University, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Key Laboratory of Shanghai Education Commission for Intelligent Interaction 摘要:机器阅读理解是一个备受关注的研究和测试领域,用于评估新的预训练模型和微调策略,最近的研究利用句法、语义和其他语言信息丰富了预训练模型,以提高模型的性能。在本文中,我们模仿人类的阅读过程,连接回指表达,并明确利用共指信息增强预训练模型中的单词嵌入,以突出共指提及,这些共指提及必须在QUOREF中识别为共指密集型问题回答,一个相对较新的数据集,专门设计用于评估模型的共指相关性能。我们使用一个额外的BERT层来关注相关引用,并使用关系图卷积网络来建模相关引用关系。我们证明了在微调阶段显式合并共指信息比在训练预先训练的语言模型时合并共指信息表现得更好。 摘要:Machine reading comprehension is a heavily-studied research and test field for evaluating new pre-trained models and fine-tuning strategies, and recent studies have enriched the pre-trained models with syntactic, semantic and other linguistic information to improve the performance of the model. In this paper, we imitated the human's reading process in connecting the anaphoric expressions and explicitly leverage the coreference information to enhance the word embeddings from the pre-trained model, in order to highlight the coreference mentions that must be identified for coreference-intensive question answering in QUOREF, a relatively new dataset that is specifically designed to evaluate the coreference-related performance of a model. We used an additional BERT layer to focus on the coreference mentions, and a Relational Graph Convolutional Network to model the coreference relations. We demonstrated that the explicit incorporation of the coreference information in fine-tuning stage performed better than the incorporation of the coreference information in training a pre-trained language models.

【8】 Identifying Causal Influences on Publication Trends and Behavior: A Case Study of the Computational Linguistics Community 标题:识别出版趋势和行为的因果影响:计算语言学社区的案例研究 链接:https://arxiv.org/abs/2110.07938

作者:Maria Glenski,Svitlana Volkova 机构:National Security Directorate, Pacific Northwest National Laboratory 备注:Accepted to First Workshop on Causal Inference & NLP at EMNLP 2021 摘要:从观测的真实世界数据中得出因果结论是一项非常令人渴望但具有挑战性的任务。在本文中,我们提出了混合方法分析,以调查出版趋势和行为对某些研究重点(计算语言学(CL)社区感兴趣的方法、材料和任务)的采用、持续和退出的因果影响。我们的主要发现突出了研究界向快速涌现的方法学过渡的证据(例如,采用影响LSTM退役的双向LSTM),持续参与趋势任务和技术(例如,深度学习、嵌入、生成和语言模型),科学家在美国境外(如中国)的位置对英语以外语言研究倾向的影响,以及资助大规模研究项目的潜在影响。我们期望这项工作能够提供关于出版趋势和行为的有用见解,并提高对计算语言学和更广泛科学界中因果推理潜力的认识。 摘要:Drawing causal conclusions from observational real-world data is a very much desired but challenging task. In this paper we present mixed-method analyses to investigate causal influences of publication trends and behavior on the adoption, persistence, and retirement of certain research foci -- methodologies, materials, and tasks that are of interest to the computational linguistics (CL) community. Our key findings highlight evidence of the transition to rapidly emerging methodologies in the research community (e.g., adoption of bidirectional LSTMs influencing the retirement of LSTMs), the persistent engagement with trending tasks and techniques (e.g., deep learning, embeddings, generative, and language models), the effect of scientist location from outside the US, e.g., China on propensity of researching languages beyond English, and the potential impact of funding for large-scale research programs. We anticipate this work to provide useful insights about publication trends and behavior and raise the awareness about the potential for causal inference in the computational linguistics and a broader scientific community.

【9】 Estimating the Level and Direction of Phonetic Dialect Change in the Northern Netherlands 标题:估计荷兰北部语音方言变化的水平和方向 链接:https://arxiv.org/abs/2110.07918

作者:Raoul Buurke,Hedwig Sekeres,Wilbert Heeringa,Remco Knooihuizen,Martijn Wieling 机构:AUTHOR, arXiv:,.,v, [cs.CL] , Oct 备注:Submitted to Taal & Tongval 摘要:本文报告了对尼德兰语北部地区方言群语音变化的持续调查,特别是已知活力不同的弗里斯语和低撒克逊语方言群。为了实现这一点,我们结合现有的语音转录语料库和方言计量方法,使我们能够在实时框架内量化老年男性方言使用者的变化。Levenshtein距离的多维变体,结合诱发转录之间真实语音距离的方法,用于估计1990年至2010年间方言组发生了多少变化,以及它们是转向标准荷兰语还是远离标准荷兰语。我们的分析表明,在这个地理区域,语言变化是一个缓慢的过程。此外,弗里斯语和格罗宁根方言群似乎最稳定,而其他低撒克逊语系(不包括格罗宁根方言群)则最容易发生变化。我们为我们的发现提供了可能的解释,同时详细讨论了数据和方法的缺陷,以及对未来研究的需求。 摘要:This article reports ongoing investigations into phonetic change of dialect groups in the northern Netherlandic language area, particularly the Frisian and Low Saxon dialect groups, which are known to differ in vitality. To achieve this, we combine existing phonetically transcribed corpora with dialectometric approaches that allow us to quantify change among older male dialect speakers in a real-time framework. A multidimensional variant of the Levenshtein distance, combined with methods that induce realistic phonetic distances between transcriptions, is used to estimate how much dialect groups have changed between 1990 and 2010, and whether they changed towards Standard Dutch or away from it. Our analyses indicate that language change is a slow process in this geographical area. Moreover, the Frisian and Groningen dialect groups seem to be most stable, while the other Low Saxon varieties (excluding the Groningen dialect group) were shown to be most prone to change. We offer possible explanations for our findings, while we discuss shortcomings of the data and approach in detail, as well as desiderata for future research.

【10】 Exploring Low-dimensional Intrinsic Task Subspace via Prompt Tuning 标题:基于提示调优的低维固有任务子空间探索 链接:https://arxiv.org/abs/2110.07867

作者:Yujia Qin,Xiaozhi Wang,Yusheng Su,Yankai Lin,Ning Ding,Zhiyuan Liu,Juanzi Li,Lei Hou,Peng Li,Maosong Sun,Jie Zhou 机构:Department of Computer Science and Technology, Tsinghua University, Beijing, China, Pattern Recognition Center, WeChat AI, Tencent Inc. 摘要:预训练语言模型(PLM)如何学习通用表示并有效地适应表面上差异很大的广泛NLP任务?在这项工作中,我们经验地发现证据表明,PLMs对各种任务的适应可以被重新参数化为只优化公共低维内在任务子空间中的几个自由参数,这可能有助于我们理解为什么PLMs可以容易地适应具有小规模数据的各种NLP任务。具体来说,为了找到这样一个子空间并检验其普遍性,我们借助于最近成功的提示调整,将多个NLP任务的软提示分解为同一个低维非线性子空间,然后我们学习仅通过调整子空间中的参数使PLM适应看不见的任务或数据。我们将此管道称为内部提示调优(IPT)。在实验中,我们研究了不同的Few-ShotNLP任务,令人惊讶地发现,在一个包含100个随机任务的5维子空间中,仅调整5个自由参数,我们就可以分别恢复100个可见任务(使用不同的训练数据)和20个不可见任务的87%和65%的全提示调整性能,显示了所发现的内在任务子空间具有很强的泛化能力。除了作为一种分析工具外,IPT还可以进一步带来实际好处,例如提高快速调优稳定性。 摘要:How can pre-trained language models (PLMs) learn universal representations and effectively adapt to broad NLP tasks differing a lot superficially? In this work, we empirically find evidences indicating that the adaptations of PLMs to various tasks can be reparameterized as optimizing only a few free parameters in a common low-dimensional intrinsic task subspace, which may help us understand why PLMs could easily adapt to various NLP tasks with small-scale data. Specifically, to find such a subspace and examine its universality, we resort to the recent success of prompt tuning and decompose the soft prompts of multiple NLP tasks into the same low-dimensional nonlinear subspace, then we learn to adapt the PLM to unseen tasks or data by only tuning parameters in the subspace. We dub this pipeline as intrinsic prompt tuning (IPT). In experiments, we study diverse few-shot NLP tasks and surprisingly find that in a 5-dimensional subspace found with 100 random tasks, by only tuning 5 free parameters, we can recover 87% and 65% of the full prompt tuning performance for 100 seen tasks (using different training data) and 20 unseen tasks, respectively, showing great generalization ability of the found intrinsic task subspace. Besides being an analysis tool, IPT could further bring practical benefits, such as improving the prompt tuning stability.

【11】 ESPnet2-TTS: Extending the Edge of TTS Research 标题:ESPnet2-TTS:拓展TTS研究的前沿 链接:https://arxiv.org/abs/2110.07840

作者:Tomoki Hayashi,Ryuichi Yamamoto,Takenori Yoshimura,Peter Wu,Jiatong Shi,Takaaki Saeki,Yooncheol Ju,Yusuke Yasuda,Shinnosuke Takamichi,Shinji Watanabe 机构:Human Dataware Lab. Co., Ltd.,Nagoya University,LINE Corp.,Nagoya Institute of Technology, Carnegie Mellon University,The University of Tokyo,AIRS Company, Hyundai Motor Group 备注:Submitted to ICASSP2022. Demo HP: this https URL 摘要:本文介绍了端到端文本到语音(E2E-TTS)工具包ESPnet2 TTS。ESPnet2 TTS通过添加许多新功能扩展了我们的早期版本ESPnet TTS,包括:动态灵活的预处理、与神经声码器的联合训练,以及最先进的TTS模型和扩展,如全频带E2E文本到波形建模,这简化了训练管道并进一步增强了TTS性能。我们配方的统一设计使用户能够快速重现最先进的E2E-TTS结果。我们还在统一的Python接口中提供了许多经过预训练的模型,用于推理,为用户生成基线样本和构建演示提供了一种快速方法。使用英语和日语语料库进行的实验评估表明,我们提供的模型合成了与基本事实相当的话语,实现了最先进的TTS性能。该工具包可在线访问https://github.com/espnet/espnet. 摘要:This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance TTS performance. The unified design of our recipes enables users to quickly reproduce state-of-the-art E2E-TTS results. We also provide many pre-trained models in a unified Python interface for inference, offering a quick means for users to generate baseline samples and build demos. Experimental evaluations with English and Japanese corpora demonstrate that our provided models synthesize utterances comparable to ground-truth ones, achieving state-of-the-art TTS performance. The toolkit is available online at https://github.com/espnet/espnet.

【12】 Cross-Lingual Fine-Grained Entity Typing 标题:跨语言细粒度实体分类 链接:https://arxiv.org/abs/2110.07837

作者:Nila Selvaraj,Yasumasa Onoe,Greg Durrett 机构:Department of Computer Science, The University of Texas at Austin 摘要:跨语言预先训练模型的增长使NLP工具能够快速推广到新语言。虽然这些模型已应用于涉及实体的任务,但它们跨语言明确预测这些实体类型特征的能力尚未建立。在本文中,我们提出了一个统一的跨语言细粒度实体类型模型,该模型能够处理100多种语言,并分析了该模型推广到训练过程中看不见的语言和实体的能力。我们使用从维基百科多语言超链接收集的跨语言训练数据(训练语言)来训练这个模型。在推理过程中,我们的模型采用特定语言(测试语言,可能不在训练语言中)中的实体名称和上下文,并预测该实体的细粒度类型。推广到新语言和看不见的实体是此实体类型设置的基本挑战,因此我们将重点评估这些设置,并与简单但功能强大的字符串匹配基线进行比较。实验结果表明,在日语、泰米尔语、阿拉伯语、塞尔维亚语和波斯语等看不见的语言上,我们的方法优于基线。此外,我们的方法大大提高了基线上看不见的实体(甚至是看不见的语言)的性能,并且人工评估显示了在这些设置中预测相关类型的强大能力。 摘要:The growth of cross-lingual pre-trained models has enabled NLP tools to rapidly generalize to new languages. While these models have been applied to tasks involving entities, their ability to explicitly predict typological features of these entities across languages has not been established. In this paper, we present a unified cross-lingual fine-grained entity typing model capable of handling over 100 languages and analyze this model's ability to generalize to languages and entities unseen during training. We train this model on cross-lingual training data collected from Wikipedia hyperlinks in multiple languages (training languages). During inference, our model takes an entity mention and context in a particular language (test language, possibly not in the training languages) and predicts fine-grained types for that entity. Generalizing to new languages and unseen entities are the fundamental challenges of this entity typing setup, so we focus our evaluation on these settings and compare against simple yet powerful string match baselines. Experimental results show that our approach outperforms the baselines on unseen languages such as Japanese, Tamil, Arabic, Serbian, and Persian. In addition, our approach substantially improves performance on unseen entities (even in unseen languages) over the baselines, and human evaluation shows a strong ability to predict relevant types in these settings.

【13】 DirectQuote: A Dataset for Direct Quotation Extraction and Attribution in News Articles 标题:DirectQuote:一种用于新闻文章直接引文抽取和归属的数据集 链接:https://arxiv.org/abs/2110.07827

作者:Yuanchi Zhang,Yang Liu 机构:Department of Computer Science and Technology, Tsinghua University, Beijing, China, Institute for AI Industry Research, Tsinghua University, Beijing, China, Institute for Artificial Intelligence, Tsinghua University, Beijing, China 摘要:引语提取和归因是一项具有挑战性的任务,旨在确定包含引语的范围,并将每个引语归因于原始说话人。将此任务应用于新闻数据与事实检查、媒体监控和新闻跟踪密切相关。直接引文更具可追溯性和信息性,因此在不同类型的引文中具有重要意义。因此,本文介绍了DirectQuote,一个包含19760个段落和10279个直接引语的语料库,这些直接引语是从在线新闻媒体手工注释而来的。据我们所知,这是最大、最完整的语料库,主要关注新闻文本中的直接引语。我们确保注释中的每个说话人都可以链接到Wikidata上的特定命名实体,从而受益于各种下游任务。此外,我们首次提出了几种序列标记模型作为基线方法,以端到端的方式同时提取和属性引用。 摘要:Quotation extraction and attribution are challenging tasks, aiming at determining the spans containing quotations and attributing each quotation to the original speaker. Applying this task to news data is highly related to fact-checking, media monitoring and news tracking. Direct quotations are more traceable and informative, and therefore of great significance among different types of quotations. Therefore, this paper introduces DirectQuote, a corpus containing 19,760 paragraphs and 10,279 direct quotations manually annotated from online news media. To the best of our knowledge, this is the largest and most complete corpus that focuses on direct quotations in news texts. We ensure that each speaker in the annotation can be linked to a specific named entity on Wikidata, benefiting various downstream tasks. In addition, for the first time, we propose several sequence labeling models as baseline methods to extract and attribute quotations simultaneously in an end-to-end manner.

【14】 Neural Dubber: Dubbing for Silent Videos According to Scripts 标题:Neuran Dubber:根据脚本为无声视频配音 链接:https://arxiv.org/abs/2110.08243

作者:Chenxu Hu,Qiao Tian,Tingle Li,Yuping Wang,Yuxuan Wang,Hang Zhao 机构:Tsinghua University, ByteDance, Shanghai Qi Zhi Institute 备注:Accepted by NeurIPS 2021 摘要:配音是重新录制演员对话的后期制作过程,广泛用于电影制作和视频制作。它通常由专业的配音演员手动执行,他们以适当的韵律朗读台词,并与预先录制的视频同步。在这项工作中,我们提出了神经配音器,第一个神经网络模型来解决一个新的自动视频配音(AVD)任务:从文本中合成与给定无声视频同步的人类语音。神经配音器是一种多模态文本到语音(TTS)模型,它利用视频中的嘴唇运动来控制生成语音的韵律。此外,还开发了基于图像的说话人嵌入(ISE)模块,用于多说话人设置,使神经配音器能够根据说话人的脸型生成具有合理音色的语音。在化学讲座单扬声器数据集和LRS2多说话者数据集上的实验表明,神经配音器可以在语音质量方面与最先进的TTS模型相媲美地生成语音音频。最重要的是,定性和定量评估表明,神经配音器可以通过视频控制合成语音的韵律,并生成与视频时间同步的高保真语音。 摘要:Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given silent video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

机器翻译,仅供参考

0 人点赞