自然语言处理学术速递[12.16]

cs.CL 方向，今日共计47篇

Transformer(4篇)

【1】 Evaluating Pretrained Transformer Models for Entity Linking in Task-Oriented Dialog 标题：在面向任务的对话框中评估用于实体链接的预训练Transformer模型链接：https://arxiv.org/abs/2112.08327

作者：Sai Muralidhar Jayanthi,Varsha Embar,Karthik Raghunathan 备注：Accepted as short paper at ICON 2021 摘要：预训练变换器模型（PTM）在自然语言任务中的广泛适用性已经得到了很好的证明，但对其理解文本中短短语的能力研究较少。为此，我们从面向任务的对话中无监督实体链接的角度，跨句法、语义、简短形式、数字和语音5个特征来评估不同的PTM。我们的结果表明，与传统技术相比，一些PTM产生的结果低于标准，尽管与其他神经基线有竞争力。我们发现他们的一些缺点可以通过使用文本相似性任务微调的PTM来解决，这表明理解语义和句法对应的能力得到了提高，并且实体提及中的简短形式、数字和语音变化也得到了一些改进。我们进行定性分析，以了解其预测中的细微差别，并讨论进一步改进的范围。有关代码，请访问https://github.com/murali1996/el_tod 摘要：The wide applicability of pretrained transformer models (PTMs) for natural language tasks is well demonstrated, but their ability to comprehend short phrases of text is less explored. To this end, we evaluate different PTMs from the lens of unsupervised Entity Linking in task-oriented dialog across 5 characteristics -- syntactic, semantic, short-forms, numeric and phonetic. Our results demonstrate that several of the PTMs produce sub-par results when compared to traditional techniques, albeit competitive to other neural baselines. We find that some of their shortcomings can be addressed by using PTMs fine-tuned for text-similarity tasks, which illustrate an improved ability in comprehending semantic and syntactic correspondences, as well as some improvements for short-forms, numeric and phonetic variations in entity mentions. We perform qualitative analysis to understand nuances in their predictions and discuss scope for further improvements. Code can be found at https://github.com/murali1996/el_tod

【2】 LongT5: Efficient Text-To-Text Transformer for Long Sequences 标题：LongT5：高效的长序列文本到文本转换器链接：https://arxiv.org/abs/2112.07916

作者：Mandy Guo,Joshua Ainslie,David Uthus,Santiago Ontanon,Jianmo Ni,Yun-Hsuan Sung,Yinfei Yang 备注：preprint 摘要：最近的工作表明，（1）增加输入长度或（2）增加模型大小都可以改善基于Transformer的神经模型的性能。在本文中，我们提出了一个新的模型，称为LongT5，我们用它来探讨同时缩放输入长度和模型大小的影响。具体而言，我们整合了长输入Transformer（ETC）的注意力思想，并将总结预训练（PEGASUS）的预训练策略引入可扩展T5体系结构。结果是一种新的注意机制，我们称之为{em Transient Global}（TGlobal），它模仿ETC的局部/全局注意机制，但不需要额外的侧面输入。我们能够在几个摘要任务上获得最先进的结果，并且在问答任务上优于原始T5模型。摘要：Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call {em Transient Global} (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.

【3】 Oracle Linguistic Graphs Complement a Pretrained Transformer Language Model: A Cross-formalism Comparison 标题：甲骨文语言图对预先训练的转换器语言模型的补充：交叉形式主义比较链接：https://arxiv.org/abs/2112.07874

作者：Jakob Prange,Nathan Schneider,Lingpeng Kong 摘要：我们研究了语言图形表示在原则上能够补充和改进神经语言建模的程度。通过一个由预先训练的变换器和来自7种不同形式之一的基本真值图组成的集合设置，我们发现，总体而言，语义成分结构对语言建模性能最有用——超过了句法成分结构以及句法和语义依赖结构。此外，效果因词类而异。总之，我们的发现指出了神经符号语言建模的发展趋势，并邀请未来的研究量化不同形式主义的设计选择。摘要：We examine the extent to which, in principle, linguistic graph representations can complement and improve neural language modeling. With an ensemble setup consisting of a pretrained Transformer and ground-truth graphs from one of 7 different formalisms, we find that, overall, semantic constituency structures are most useful to language modeling performance -- outpacing syntactic constituency structures as well as syntactic and semantic dependency structures. Further, effects vary greatly depending on part-of-speech class. In sum, our findings point to promising tendencies in neuro-symbolic language modeling and invite future research quantifying the design choices made by different formalisms.

【4】 Cross-Domain Generalization and Knowledge Transfer in Transformers Trained on Legal Data 标题：接受法律数据训练的Transformer中的跨域概括和知识转移链接：https://arxiv.org/abs/2112.07870

作者：Jaromir Savelka,Hannes Westermann,Karim Benyekhlef 备注：11 pages, In ASAIL@ JURIX. 2020 摘要：我们分析了预先训练的语言模型在不同类型系统注释的数据集之间传递知识的能力，以及它们在训练的领域和数据集之外的泛化能力。我们创建了一个元任务，在多个数据集中集中集中预测修辞角色。预测句子在案件判决中所起的修辞作用是人工智能和法学中一项重要且经常研究的任务。通常，训练模型需要对大量句子进行注释，这既耗时又昂贵。此外，模型的应用被限制在它所训练的同一数据集上。我们对语言模型进行了微调，并评估了它们在不同数据集上的性能，以研究模型在不同领域的泛化能力。我们的结果表明，该方法有助于克服主动或交互学习中的冷启动问题，并显示了模型跨数据集和领域的泛化能力。摘要：We analyze the ability of pre-trained language models to transfer knowledge among datasets annotated with different type systems and to generalize beyond the domain and dataset they were trained on. We create a meta task, over multiple datasets focused on the prediction of rhetorical roles. Prediction of the rhetorical role a sentence plays in a case decision is an important and often studied task in AI & Law. Typically, it requires the annotation of a large number of sentences to train a model, which can be time-consuming and expensive. Further, the application of the models is restrained to the same dataset it was trained on. We fine-tune language models and evaluate their performance across datasets, to investigate the models' ability to generalize across domains. Our results suggest that the approach could be helpful in overcoming the cold-start problem in active or interactvie learning, and shows the ability of the models to generalize across datasets and domains.

机器翻译(4篇)

【1】 Textless Speech-to-Speech Translation on Real Data 标题：基于真实数据的无文本语音到语音翻译链接：https://arxiv.org/abs/2112.08352

作者：Ann Lee,Hongyu Gong,Paul-Ambroise Duquenne,Holger Schwenk,Peng-Jen Chen,Changhan Wang,Sravya Popuri,Juan Pino,Jiatao Gu,Wei-Ning Hsu 摘要：我们提出了一个无文本语音转换（S2ST）系统，该系统可以将语音从一种语言转换为另一种语言，并且不需要任何文本数据。与现有文献中的工作不同，我们解决了多说话人目标语音建模的挑战，并使用真实世界的S2ST数据对系统进行训练。我们的方法的关键是一种基于自我监督单元的语音规范化技术，它使用来自多个说话人和一个参考说话人的成对音频对预先训练的语音编码器进行微调，以减少由于口音引起的变化，同时保留词汇内容。在语音标准化的配对数据只有10分钟的情况下，与在非标准化语音目标上训练的基线相比，在vp~S2ST数据集上训练S2ST模型时，我们平均获得3.2 BLEU增益。我们还加入了自动挖掘的S2ST数据，并显示了额外的2.0 BLEU增益。据我们所知，我们是第一个建立无文本S2ST技术的人，该技术可以使用真实世界的数据进行训练，并适用于多种语言对。摘要：We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data. The key to our approach is a self-supervised unit-based speech normalization technique, which finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents, while preserving the lexical content. With only 10 minutes of paired data for speech normalization, we obtain on average 3.2 BLEU gain when training the S2ST model on the vp~S2ST dataset, compared to a baseline trained on un-normalized speech target. We also incorporate automatically mined S2ST data and show an additional 2.0 BLEU gain. To our knowledge, we are the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs.

【2】 Improving both domain robustness and domain adaptability in machine translation 标题：机器翻译中领域稳健性和领域适应性的提高链接：https://arxiv.org/abs/2112.08288

作者：Wen Lai,Jindřich Libovický,Alexander Fraser 摘要：我们解决两个问题域适应在神经机器翻译。首先，我们希望达到域稳健性，即，训练数据中的域和训练数据中看不到的域都具有良好的质量。第二，我们希望我们的系统是自适应的，也就是说，可以用几百个域内并行语句对系统进行微调。在本文中，我们介绍了前两种方法的一种新组合，即解决域稳健性的单词自适应建模和解决域适应性的元学习，我们给出的实证结果表明，我们的新组合改进了这两种特性。摘要：We address two problems of domain adaptation in neural machine translation. First, we want to reach domain robustness, i.e., good quality of both domains from the training data, and domains unseen in the training data. Second, we want our systems to be adaptive, i.e., making it possible to finetune systems with just hundreds of in-domain parallel sentences. In this paper, we introduce a novel combination of two previous approaches, word adaptive modelling, which addresses domain robustness, and meta-learning, which addresses domain adaptability, and we present empirical results showing that our new combination improves both of these properties.

【3】 Lesan -- Machine Translation for Low Resource Languages 标题：乐山--面向低资源语言的机器翻译链接：https://arxiv.org/abs/2112.08191

作者：Asmelash Teka Hadgu,Abel Aregawi,Adam Beaudoin 备注：4 pages, 2 figures, 35th Conference on Neural Information Processing Systems (NeurIPS 2021) demonstrations track 摘要：全世界数百万人无法访问Web上的内容，因为大部分内容都无法以他们的语言随时获取。机器翻译（MT）系统有可能改变许多语言的这种情况。当前的机器翻译系统为高资源语言对（如德语和英语）提供了非常准确的结果。然而，对于许多低资源语言，机器翻译仍在积极研究中。关键的挑战是缺乏构建这些系统的数据集。我们介绍了Lesan，一个面向低资源语言的机器翻译系统。我们的管道通过利用在线和离线资源、Ethiopic的定制OCR系统和自动校准模块，解决了低资源MT的关键瓶颈。管道中的最后一步是一个序列到序列模型，该模型将并行语料库作为输入，并为我们提供一个翻译模型。Lesan的翻译模型基于Transformer架构。在构建了一个基础模型后，使用回译来利用单语语料库。目前，Lesan支持翻译Tigrinya语、阿姆哈拉语和英语。我们进行了广泛的人类评估，结果表明，Lesan在所有六对测试中都优于最先进的系统，如Google Translate和Microsoft Translator。Lesan是免费提供的，到目前为止已经提供了1000多万份翻译。目前，只有217篇蒂格里尼亚语和15009篇阿姆哈拉语维基百科文章。我们相信，Lesan将为数百万人通过MT实现网络访问的民主化做出贡献。摘要：Millions of people around the world can not access content on the Web because most of the content is not readily available in their language. Machine translation (MT) systems have the potential to change this for many languages. Current MT systems provide very accurate results for high resource language pairs, e.g., German and English. However, for many low resource languages, MT is still under active research. The key challenge is lack of datasets to build these systems. We present Lesan, an MT system for low resource languages. Our pipeline solves the key bottleneck to low resource MT by leveraging online and offline sources, a custom OCR system for Ethiopic and an automatic alignment module. The final step in the pipeline is a sequence to sequence model that takes parallel corpus as input and gives us a translation model. Lesan's translation model is based on the Transformer architecture. After constructing a base model, back translation, is used to leverage monolingual corpora. Currently Lesan supports translation to and from Tigrinya, Amharic and English. We perform extensive human evaluation and show that Lesan outperforms state-of-the-art systems such as Google Translate and Microsoft Translator across all six pairs. Lesan is freely available and has served more than 10 million translations so far. At the moment, there are only 217 Tigrinya and 15,009 Amharic Wikipedia articles. We believe that Lesan will contribute towards democratizing access to the Web through MT for millions of people.

【4】 Faster Nearest Neighbor Machine Translation 标题：更快的最近邻机器翻译链接：https://arxiv.org/abs/2112.08152

作者：Shuhe Wang,Jiwei Li,Yuxian Meng,Rongbin Ouyang,Guoyin Wang,Xiaoya Li,Tianwei Zhang,Shi Zong 摘要：$ $ $神经网络为基础的神经机器翻译（$ K$ NN-MT）已经取得了最先进的结果在各种MT任务。$k$NN-MT的一个显著缺点在于，它在从整个数据存储中识别查询表示的$k$最近邻方面效率低下，当数据存储规模较大时，这会占用大量时间。在这项工作中，我们建议textbf{Faster$k$NN-MT}解决这个问题。Faster$k$NN-MT的核心思想是使用分层聚类策略来近似查询与数据存储中数据点之间的距离，该距离被分解为两部分：查询与数据点所属集群中心之间的距离，以及数据点与群集中心之间的距离。我们提出了以显著更快的方式计算这两部分的实用方法。通过在不同机器翻译基准上的大量实验，我们表明textbf{Faster$k$NN-MT}比Fast$k$NN-MTcitep{meng2021fast}快，并且在保持模型性能为$k$NN-MT的同时，只比普通的基准稍慢（1.2倍）。Faster$k$NN-MT允许在真实世界的机器翻译服务上部署$k$NN-MT模型。摘要：$k$NN based neural machine translation ($k$NN-MT) has achieved state-of-the-art results in a variety of MT tasks. One significant shortcoming of $k$NN-MT lies in its inefficiency in identifying the $k$ nearest neighbors of the query representation from the entire datastore, which is prohibitively time-intensive when the datastore size is large. In this work, we propose textbf{Faster $k$NN-MT} to address this issue. The core idea of Faster $k$NN-MT is to use a hierarchical clustering strategy to approximate the distance between the query and a data point in the datastore, which is decomposed into two parts: the distance between the query and the center of the cluster that the data point belongs to, and the distance between the data point and the cluster center. We propose practical ways to compute these two parts in a significantly faster manner. Through extensive experiments on different MT benchmarks, we show that textbf{Faster $k$NN-MT} is faster than Fast $k$NN-MT citep{meng2021fast} and only slightly (1.2 times) slower than its vanilla counterpart while preserving model performance as $k$NN-MT. Faster $k$NN-MT enables the deployment of $k$NN-MT models on real-world MT services.

语义分析(1篇)

【1】 Maximum Bayes Smatch Ensemble Distillation for AMR Parsing 标题：AMR分析中的最大贝叶斯匹配集成提取链接：https://arxiv.org/abs/2112.07790

作者：Young-Suk Lee,Ramon Fernandez Astudillo,Thanh Lam Hoang,Tahira Naseem,Radu Florian,Salim Roukos 摘要：AMR解析在过去三年中经历了前所未有的性能提升，这是由于体系结构改进和迁移学习等多种影响的综合作用。自我学习技巧也在推动绩效方面发挥了作用。然而，对于最新的高性能解析器，自学习和白银数据生成的效果似乎正在减弱。在本文中，我们展示了通过将基于Smatch的集成技术与集成蒸馏相结合来克服银数据收益递减的可能性。在一个广泛的实验装置中，我们首次将单模型英语语法分析器的性能提高到85 Smatch以上，并恢复到实质性的提高。我们还实现了中文、德文、意大利文和西班牙文的跨语言AMR解析的新技术。最后，我们探讨了提出的蒸馏技术对域适应的影响，并表明它可以产生与QALD-9的人类注释数据相媲美的增益，实现生物磁共振的新技术。摘要：AMR parsing has experienced an unprecendented increase in performance in the last three years, due to a mixture of effects including architecture improvements and transfer learning. Self-learning techniques have also played a role in pushing performance forward. However, for most recent high performant parsers, the effect of self-learning and silver data generation seems to be fading. In this paper we show that it is possible to overcome this diminishing returns of silver data by combining Smatch-based ensembling techniques with ensemble distillation. In an extensive experimental setup, we push single model English parser performance above 85 Smatch for the first time and return to substantial gains. We also attain a new state-of-the-art for cross-lingual AMR parsing for Chinese, German, Italian and Spanish. Finally we explore the impact of the proposed distillation technique on domain adaptation, and show that it can produce gains rivaling those of human annotated data for QALD-9 and achieve a new state-of-the-art for BioAMR.

Graph|知识图谱|Knowledge(2篇)

【1】 Knowledge-Grounded Dialogue Generation with a Unified Knowledge Representation 标题：具有统一知识表示的基于知识的对话生成链接：https://arxiv.org/abs/2112.07924

作者：Yu Li,Baolin Peng,Yelong Shen,Yi Mao,Lars Liden,Zhou Yu,Jianfeng Gao 摘要：由于缺乏训练数据和异构的知识来源，建立以知识为基础的对话系统具有挑战性。由于训练数据中涉及的主题有限，现有系统在看不见的主题上表现不佳。此外，异构知识源使得系统很难推广到其他任务，因为不同知识表示的知识源需要不同的知识编码器。为了应对这些挑战，我们提出了PLUG，这是一种语言模型，它将不同的知识源同质化为基于知识的对话生成任务的统一知识表示。PLUG预先接受了对话生成任务的训练，该任务以统一的基本知识表示为条件。通过一些训练实例，它可以推广到不同的基于下游知识的对话生成任务。对两个基准的实证评估表明，我们的模型在不同的知识基础任务中具有良好的通用性。在完全监督的环境下，它可以实现与最先进的方法相当的性能，并且在零拍和少拍的环境下，它的性能显著优于其他方法。摘要：Knowledge-grounded dialogue systems are challenging to build due to the lack of training data and heterogeneous knowledge sources. Existing systems perform poorly on unseen topics due to limited topics covered in the training data. In addition, heterogeneous knowledge sources make it challenging for systems to generalize to other tasks because knowledge sources in different knowledge representations require different knowledge encoders. To address these challenges, we present PLUG, a language model that homogenizes different knowledge sources to a unified knowledge representation for knowledge-grounded dialogue generation tasks. PLUG is pre-trained on a dialogue generation task conditioned on a unified essential knowledge representation. It can generalize to different downstream knowledge-grounded dialogue generation tasks with a few training examples. The empirical evaluation on two benchmarks shows that our model generalizes well across different knowledge-grounded tasks. It can achieve comparable performance with state-of-the-art methods under a fully-supervised setting and significantly outperforms other methods in zero-shot and few-shot settings.

【2】 Knowledge-Rich Self-Supervised Entity Linking 标题：知识丰富的自监督实体链接链接：https://arxiv.org/abs/2112.07887

作者：Sheng Zhang,Hao Cheng,Shikhar Vashishth,Cliff Wong,Jinfeng Xiao,Xiaodong Liu,Tristan Naumann,Jianfeng Gao,Hoifung Poon 摘要：实体链接面临着巨大的挑战，例如大量的变体和普遍存在的歧义，特别是在实体众多的高价值领域。标准分类方法存在注释瓶颈，无法有效处理看不见的实体。Zero-Shot实体链接已经成为推广到新实体的一个有希望的方向，但它仍然需要在训练期间提到黄金实体的例子和对所有实体的规范描述，这两者在维基百科之外很少可用。在本文中，我们通过利用现成的领域知识，探索知识丰富的自我监督（$tt-kris$）用于实体链接。在训练中，它使用领域本体生成未标记文本的自我监督提及示例，并使用对比学习训练上下文编码器。对于推理，它将自我监督提及作为每个实体的原型进行采样，并通过将测试提及映射到最相似的原型来进行链接。我们的方法包含Zero-Shot和Few-Shot方法，并且可以很容易地合并实体描述和黄金提及标签（如果可用）。以生物医学为例，我们对七个标准数据集进行了广泛的实验，这些数据集涵盖生物医学文献和临床记录。在不使用任何标记信息的情况下，我们的方法生成$tt Krisbert$，这是一个适用于400万UMLS实体的通用实体链接器，达到了最新的水平，比以前的自监督方法的准确度高出20多个绝对点。摘要：Entity linking faces significant challenges, such as prolific variations and prevalent ambiguities, especially in high-value domains with myriad entities. Standard classification approaches suffer from the annotation bottleneck and cannot effectively handle unseen entities. Zero-shot entity linking has emerged as a promising direction for generalizing to new entities, but it still requires example gold entity mentions during training and canonical descriptions for all entities, both of which are rarely available outside of Wikipedia. In this paper, we explore Knowledge-RIch Self-Supervision ($tt KRISS$) for entity linking, by leveraging readily available domain knowledge. In training, it generates self-supervised mention examples on unlabeled text using a domain ontology and trains a contextual encoder using contrastive learning. For inference, it samples self-supervised mentions as prototypes for each entity and conducts linking by mapping the test mention to the most similar prototype. Our approach subsumes zero-shot and few-shot methods, and can easily incorporate entity descriptions and gold mention labels if available. Using biomedicine as a case study, we conducted extensive experiments on seven standard datasets spanning biomedical literature and clinical notes. Without using any labeled information, our method produces $tt KRISSBERT$, a universal entity linker for four million UMLS entities, which attains new state of the art, outperforming prior self-supervised methods by as much as over 20 absolute points in accuracy.

摘要|信息提取(1篇)

【1】 GenIE: Generative Information Extraction 标题：GENIE：产生式信息抽取链接：https://arxiv.org/abs/2112.08340

作者：Martin Josifoski,Nicola De Cao,Maxime Peyrard,Robert West 摘要：文本的结构化和扎根表示通常通过封闭信息提取进行形式化，即从知识库模式中提取与预定义的一组实体和关系一致的详尽的（主题、关系、对象）三元组。大多数现有工作都是容易累积错误的管道，所有方法都只适用于不现实的少量实体和关系。我们介绍GenIE（生成信息提取），这是封闭信息提取的第一个端到端自回归公式。GenIE通过以文本形式自动回归生成关系和实体，自然地利用预先训练好的转换器中的语言知识。由于新的双层约束生成策略，只生成与预定义的知识库模式一致的三元组。我们的实验表明，GenIE在封闭信息提取方面是最先进的，从比基线更少的训练数据点进行概括，并扩展到以前无法管理的大量实体和关系。通过这项工作，封闭式信息提取在现实场景中变得切实可行，为下游任务提供了新的机会。最后，这项工作为信息提取核心任务的统一端到端方法铺平了道路。代码和型号可在https://github.com/epfl-dlab/GenIE. 摘要：Structured and grounded representation of text is typically formalized by closed information extraction, the problem of extracting an exhaustive set of (subject, relation, object) triplets that are consistent with a predefined set of entities and relations from a knowledge base schema. Most existing works are pipelines prone to error accumulation, and all approaches are only applicable to unrealistically small numbers of entities and relations. We introduce GenIE (generative information extraction), the first end-to-end autoregressive formulation of closed information extraction. GenIE naturally exploits the language knowledge from the pre-trained transformer by autoregressively generating relations and entities in textual form. Thanks to a new bi-level constrained generation strategy, only triplets consistent with the predefined knowledge base schema are produced. Our experiments show that GenIE is state-of-the-art on closed information extraction, generalizes from fewer training data points than baselines, and scales to a previously unmanageable number of entities and relations. With this work, closed information extraction becomes practical in realistic scenarios, providing new opportunities for downstream tasks. Finally, this work paves the way towards a unified end-to-end approach to the core tasks of information extraction. Code and models available at https://github.com/epfl-dlab/GenIE.

推理|分析|理解|解释(3篇)

【1】 Is "my favorite new movie" my favorite movie? Probing the Understanding of Recursive Noun Phrases 链接：https://arxiv.org/abs/2112.08326

作者：Qing Lyu,Hua Zheng,Daoxin Li,Li Zhang,Marianna Apidianaki,Chris Callison-Burch 摘要：递归名词短语（NPs）具有有趣的语义特性。例如，“我最喜欢的新电影”不一定是“我最喜欢的电影”，而“我最喜欢的新电影”则是。这是人类的常识，但尚不清楚预先训练过的语言模型是否具有这种知识。我们介绍递归名词短语挑战（RNPC），这是一个旨在理解递归NPs的挑战集。在我们的数据集上进行评估时，最先进的Transformer模型只能获得大约偶然的性能。尽管如此，我们还是证明了这些知识是可以通过适当的数据学习的。我们进一步探讨了可以从任务中学习到的相关语言特征的模型，包括修饰语语义范畴和修饰语范围。最后，在RNPC上训练的模型在外部伤害检测任务上实现了强大的Zero-Shot性能，显示了在下游应用中理解递归NPs的有用性。所有代码和数据将在https://github.com/veronica320/Recursive-NPs. 摘要：Recursive noun phrases (NPs) have interesting semantic properties. For example, "my favorite new movie" is not necessarily "my favorite movie", whereas "my new favorite movie" is. This is common sense to humans, yet it is unknown whether pre-trained language models have such knowledge. We introduce the Recursive Noun Phrase Challenge (RNPC), a challenge set targeting the understanding of recursive NPs. When evaluated on our dataset, state-of-the-art Transformer models only achieve around chance performance. Still, we show that such knowledge is learnable with appropriate data. We further probe the models for relevant linguistic features that can be learned from our tasks, including modifier semantic category and modifier scope. Finally, models trained on RNPC achieve strong zero-shot performance on an extrinsic Harm Detection task, showing the usefulness of the understanding of recursive NPs in downstream applications. All code and data will be released at https://github.com/veronica320/Recursive-NPs.

【2】 Mask-combine Decoding and Classification Approach for Punctuation Prediction with real-time Inference Constraints 标题：带实时推理约束的标点符号预测掩码结合解码分类方法链接：https://arxiv.org/abs/2112.08098

作者：Christoph Minixhofer,Ondřej Klejch,Peter Bell 备注：4 pages, 3 figures, to appear in ICASSP2022 摘要：在这项工作中，我们在一个框架中统一了几种现有的标点预测解码策略，并引入了一种新的策略，该策略在不同的窗口中利用每个单词的多个预测。我们表明，在训练模型后，通过优化这些策略可以实现显著的改进，只会导致推理时间的潜在增加，而不需要再训练。我们进一步使用我们的解码策略框架首次比较了实时环境中标点预测的标记和分类方法。我们的结果表明，当很少或没有右侧上下文可用时，标点预测的分类方法是有益的。摘要：In this work, we unify several existing decoding strategies for punctuation prediction in one framework and introduce a novel strategy which utilises multiple predictions at each word across different windows. We show that significant improvements can be achieved by optimising these strategies after training a model, only leading to a potential increase in inference time, with no requirement for retraining. We further use our decoding strategy framework for the first comparison of tagging and classification approaches for punctuation prediction in a real-time setting. Our results show that a classification approach for punctuation prediction can be beneficial when little or no right-side context is available.

【3】 Do Answers to Boolean Questions Need Explanations? Yes 标题：布尔问题的答案需要解释吗？是链接：https://arxiv.org/abs/2112.07772

作者：Sara Rosenthal,Mihaela Bornea,Avirup Sil,Radu Florian,Scott McCarley 备注：9 pages 摘要：包含布尔问题的现有数据集，如BoolQ和TYDI-QA，为用户提供对问题的是/否回答。然而，对于一个可解释的系统来说，一个单词的回答是不够的。我们通过发布一组新的注释来提高可解释性，这些注释标记了现有TyDi QA和BoolQ数据集中的证据。我们表明，与依赖现有资源的模型相比，我们的注释可用于训练提取改进证据跨度的模型。我们通过一项用户研究证实了我们的发现，该研究表明我们提取的证据可以增强用户体验。我们还进一步深入了解了回答布尔问题的挑战，例如包含相互冲突的是与否答案的段落，以及预测证据的不同关联程度。摘要：Existing datasets that contain boolean questions, such as BoolQ and TYDI QA , provide the user with a YES/NO response to the question. However, a one word response is not sufficient for an explainable system. We promote explainability by releasing a new set of annotations marking the evidence in existing TyDi QA and BoolQ datasets. We show that our annotations can be used to train a model that extracts improved evidence spans compared to models that rely on existing resources. We confirm our findings with a user study which shows that our extracted evidence spans enhance the user experience. We also provide further insight into the challenges of answering boolean questions, such as passages containing conflicting YES and NO answers, and varying degrees of relevance of the predicted evidence.

GAN|对抗|攻击|生成相关(2篇)

【1】 DG2: Data Augmentation Through Document Grounded Dialogue Generation 标题：DG2：通过基于文档的对话生成实现数据增强链接：https://arxiv.org/abs/2112.08342

作者：Qingyang Wu,Song Feng,Derek Chen,Sachindra Joshi,Luis A. Lastras,Zhou Yu 摘要：为训练对话系统收集数据可能非常昂贵，因为需要人工参与，并且需要大量注释。特别是在基于文档的对话系统中，人类专家需要仔细阅读非结构化文档来回答用户的问题。因此，现有的基于文档的对话数据集规模相对较小，阻碍了对话系统的有效训练。在本文中，我们提出了一种通过生成对话模型基于文档的自动数据扩充技术。对话模型由用户bot和代理bot组成，用户bot和代理bot可以在给定输入文档的情况下合成不同的对话，然后用于训练下游模型。在补充原始数据集时，我们的方法比传统的数据扩充方法有了显著的改进。我们还可以在低资源环境下实现出色的性能。摘要：Collecting data for training dialog systems can be extremely expensive due to the involvement of human participants and need for extensive annotation. Especially in document-grounded dialog systems, human experts need to carefully read the unstructured documents to answer the users' questions. As a result, existing document-grounded dialog datasets are relatively small-scale and obstruct the effective training of dialogue systems. In this paper, we propose an automatic data augmentation technique grounded on documents through a generative dialogue model. The dialogue model consists of a user bot and agent bot that can synthesize diverse dialogues given an input document, which are then used to train a downstream model. When supplementing the original dataset, our method achieves significant improvement over traditional data augmentation methods. We also achieve great performance in the low-resource setting.

【2】 KGR^4: Retrieval, Retrospect, Refine and Rethink for Commonsense Generation 标题：KGR^4：常识生成的检索、回顾、提炼和反思链接：https://arxiv.org/abs/2112.08266

作者：Xin Liu,Dayiheng Liu,Baosong Yang,Haibo Zhang,Junwei Ding,Wenqing Yao,Weihua Luo,Haiying Zhang,Jinsong Su 备注：None 摘要：生成性常识推理要求机器在给定多个概念的情况下生成描述日常场景的句子，这一点最近引起了广泛关注。然而，现有模型的表现不如人类，因为它们产生的句子往往不可信且语法错误。在本文中，受人类造句过程的启发，我们提出了一个新的知识增强的常识生成框架，称为KGR^4，由四个阶段组成：检索、回顾、提炼、反思。在此框架下，我们首先进行检索，从外部语料库中搜索相关句子作为原型。然后，我们训练生成器编辑或复制这些原型以生成候选句子，其中潜在的错误将由基于自动编码器的细化器修复。最后，我们从具有不同超参数的生成器生成的候选句子中选择输出句子。实验结果和对CommonGen基准的深入分析有力地证明了我们框架的有效性。特别是，KGR^4在官方排行榜中获得33.56个SPICE积分，比之前报告的最佳成绩高出2.49个SPICE积分，并实现了最先进的性能。摘要：Generative commonsense reasoning requires machines to generate sentences describing an everyday scenario given several concepts, which has attracted much attention recently. However, existing models cannot perform as well as humans, since sentences they produce are often implausible and grammatically incorrect. In this paper, inspired by the process of humans creating sentences, we propose a novel Knowledge-enhanced Commonsense Generation framework, termed KGR^4, consisting of four stages: Retrieval, Retrospect, Refine, Rethink. Under this framework, we first perform retrieval to search for relevant sentences from external corpus as the prototypes. Then, we train the generator that either edits or copies these prototypes to generate candidate sentences, of which potential errors will be fixed by an autoencoder-based refiner. Finally, we select the output sentence from candidate sentences produced by generators with different hyper-parameters. Experimental results and in-depth analysis on the CommonGen benchmark strongly demonstrate the effectiveness of our framework. Particularly, KGR^4 obtains 33.56 SPICE points in the official leaderboard, outperforming the previously-reported best result by 2.49 SPICE points and achieving state-of-the-art performance.

半/弱/无监督|不确定性(1篇)

【1】 Learning to Retrieve Passages without Supervision 标题：学会在没有监督的情况下检索文章链接：https://arxiv.org/abs/2112.07708

作者：Ori Ram,Gal Shachaf,Omer Levy,Jonathan Berant,Amir Globerson 摘要：用于开放域问答（ODQA）的密集检索器通过在大型问题通道对数据集上进行训练，已显示出令人印象深刻的性能。我们研究密集型检索器是否能够以自我监督的方式学习，并且在没有任何注释的情况下有效地应用。我们观察到现有的预训练检索模型在这种情况下很难实现，并提出了一种新的预训练检索方案：重复跨度检索。我们在文档中使用跨段落的重复跨距来创建用于对比学习的伪示例。由此产生的模型——Spider——在广泛的ODQA数据集上表现出奇地好，没有任何示例，并且与BM25（一个强大的稀疏基线）竞争。此外，当对来自其他数据集的问题进行评估时，Spider通常比DPR在自然问题上训练的强基线表现更好。我们的混合检索器将Spider与BM25结合在一起，在所有数据集上改进了其组件，并且通常与域内DPR模型竞争，后者在成千上万的示例上进行了训练。摘要：Dense retrievers for open-domain question answering (ODQA) have been shown to achieve impressive performance by training on large datasets of question-passage pairs. We investigate whether dense retrievers can be learned in a self-supervised fashion, and applied effectively without any annotations. We observe that existing pretrained models for retrieval struggle in this scenario, and propose a new pretraining scheme designed for retrieval: recurring span retrieval. We use recurring spans across passages in a document to create pseudo examples for contrastive learning. The resulting model -- Spider -- performs surprisingly well without any examples on a wide range of ODQA datasets, and is competitive with BM25, a strong sparse baseline. In addition, Spider often outperforms strong baselines like DPR trained on Natural Questions, when evaluated on questions from other datasets. Our hybrid retriever, which combines Spider with BM25, improves over its components across all datasets, and is often competitive with in-domain DPR models, which are trained on tens of thousands of examples.

检测相关(1篇)

【1】 Cognition-aware Cognate Detection 标题：认知感知的同源检测链接：https://arxiv.org/abs/2112.08087

作者：Diptesh Kanojia,Prashant Sharma,Sayali Ghodekar,Pushpak Bhattacharyya,Gholamreza Haffari,Malhar Kulkarni 备注：Published at EACL 2021 摘要：同源词的自动检测有助于机器翻译、跨语言信息检索、计算系统发育和跨语言命名实体识别等NLP下游任务。以前的同源词检测方法使用基于正交、语音和语义相似度的特征集。在本文中，我们提出了一种新的方法来丰富特征集，从人类读者的注视行为中提取认知特征。我们收集了一小部分同源词的注视行为数据，并表明提取的认知特征有助于同源词检测。然而，数据收集和注释是一项成本高昂的任务。我们使用收集到的注视行为数据预测更大样本的认知特征，并表明预测的认知特征也显著提高了任务绩效。我们报告，与之前提出的方法相比，收集的凝视特征提高了10%，使用预测的凝视特征提高了12%。此外，我们还发布了收集到的注视行为数据以及我们的代码和跨语言模型。摘要：Automatic detection of cognates helps downstream NLP tasks of Machine Translation, Cross-lingual Information Retrieval, Computational Phylogenetics and Cross-lingual Named Entity Recognition. Previous approaches for the task of cognate detection use orthographic, phonetic and semantic similarity based features sets. In this paper, we propose a novel method for enriching the feature sets, with cognitive features extracted from human readers' gaze behaviour. We collect gaze behaviour data for a small sample of cognates and show that extracted cognitive features help the task of cognate detection. However, gaze data collection and annotation is a costly task. We use the collected gaze behaviour data to predict cognitive features for a larger sample and show that predicted cognitive features, also, significantly improve the task performance. We report improvements of 10% with the collected gaze features, and 12% using the predicted gaze features, over the previously proposed approaches. Furthermore, we release the collected gaze behaviour data along with our code and cross-lingual models.

识别/分类(3篇)

【1】 One System to Rule them All: a Universal Intent Recognition System for Customer Service Chatbots 标题：一个统领一切的系统：一个面向客服聊天机器人的通用意图识别系统链接：https://arxiv.org/abs/2112.08261

作者：Juan Camilo Vasquez-Correa,Juan Carlos Guerrero-Sierra,Jose Luis Pemberty-Tamayo,Juan Esteban Jaramillo,Andres Felipe Tejada-Castro 摘要：客户服务聊天机器人是一种会话系统，旨在向客户提供不同公司提供的产品/服务信息。特别是，意图识别是聊天机器人系统自然语言理解能力的核心组件之一。在聊天机器人训练识别的不同意图中，有一组对任何客户服务聊天机器人都通用。通用意图可能包括敬礼、将对话切换到人类代理、道别等。识别这些普遍意图的系统将非常有助于优化特定客户服务聊天机器人的训练过程。我们建议开发一个通用意图识别系统，该系统经过训练，能够识别28个不同聊天机器人中常见的11个意图。该系统的训练考虑了最先进的单词嵌入模型，如word2vec和BERT，以及基于卷积和递归神经网络的深度分类器。所提出的模型能够区分这些普遍意图，平衡精度高达80.4%。此外，建议的系统在识别短文本和长文本请求中表达的意图方面同样准确。同时，语义场非常相似的意图（如“再见”和“肯定评论”）之间经常出现错误分类。建议的系统将非常有助于优化客户服务聊天机器人的训练过程，因为我们的系统已经可以检测到一些意图。同时，所提出的方法将是一个合适的基础模型，通过应用迁移学习策略来训练更具体的聊天机器人。摘要：Customer service chatbots are conversational systems designed to provide information to customers about products/services offered by different companies. Particularly, intent recognition is one of the core components in the natural language understating capabilities of a chatbot system. Among the different intents that a chatbot is trained to recognize, there is a set of them that is universal to any customer service chatbot. Universal intents may include salutation, switch the conversation to a human agent, farewells, among others. A system to recognize those universal intents will be very helpful to optimize the training process of specific customer service chatbots. We propose the development of a universal intent recognition system, which is trained to recognize a selected group of 11 intents that are common in 28 different chatbots. The proposed system is trained considering state-of-the-art word-embedding models such as word2vec and BERT, and deep classifiers based on convolutional and recurrent neural networks. The proposed model is able to discriminate between those universal intents with a balanced accuracy up to 80.4%. In addition, the proposed system is equally accurate to recognize intents expressed both in short and long text requests. At the same time, misclassification errors often occurs between intents with very similar semantic fields such as farewells and positive comments. The proposed system will be very helpful to optimize the training process of a customer service chatbot because some of the intents will be already available and detected by our system. At the same time, the proposed approach will be a suitable base model to train more specific chatbots by applying transfer learning strategies.

【2】 Named entity recognition architecture combining contextual and global features 标题：结合上下文和全局特征的命名实体识别体系结构链接：https://arxiv.org/abs/2112.08033

作者：Tran Thi Hong Hanh,Antoine Doucet,Nicolas Sidere,Jose G. Moreno,Senja Pollak 摘要：命名实体识别（NER）是一种信息提取技术，旨在定位和分类命名实体（如组织、地点等）将文档中的内容划分为预定义的类别。正确识别这些短语对于简化信息访问具有重要作用。然而，这仍然是一项困难的任务，因为命名实体（NE）具有多种形式，并且它们依赖于上下文。虽然上下文可以用上下文特征来表示，但这些模型往往会歪曲全球关系。在本文中，我们提出结合XLNet的上下文特征和图卷积网络（GCN）的全局特征来提高NER性能。在广泛使用的数据集CoNLL 2003上进行的实验表明了我们的策略的好处，其结果与最新技术（SOTA）具有竞争力。摘要：Named entity recognition (NER) is an information extraction technique that aims to locate and classify named entities (e.g., organizations, locations,...) within a document into predefined categories. Correctly identifying these phrases plays a significant role in simplifying information access. However, it remains a difficult task because named entities (NEs) have multiple forms and they are context-dependent. While the context can be represented by contextual features, global relations are often misrepresented by those models. In this paper, we propose the combination of contextual features from XLNet and global features from Graph Convolution Network (GCN) to enhance NER performance. Experiments over a widely-used dataset, CoNLL 2003, show the benefits of our strategy, with results competitive with the state of the art (SOTA).

【3】 The exploitation of Multiple Feature Extraction Techniques for Speaker Identification in Emotional States under Disguised Voices 标题：多特征提取技术在伪装语音情感状态说话人识别中的应用链接：https://arxiv.org/abs/2112.07940

作者：Noor Ahmad Al Hindawi,Ismail Shahin,Ali Bou Nassif 备注：5 pages, 1 figure, accepted in the 14th International Conference on Developments in eSystems Engineering, 7-10 December, 2021 摘要：由于人工智能技术的进步，说话人识别（SI）技术带来了巨大的发展方向，目前已广泛应用于各个领域。特征提取是SI最重要的组成部分之一，它对SI过程和性能有重大影响。因此，大量的特征提取策略被深入研究、对比和分析。本文利用五种不同的特征提取方法对情感环境下变相语音中的说话人进行识别。为了显著评估这项工作，使用了三种效果：高音、低音和电子语音转换（EVC）。实验结果表明，串联的Mel倒谱系数（mfcc）、MFCCsδ和MFCCsδ是最好的特征提取方法。摘要：Due to improvements in artificial intelligence, speaker identification (SI) technologies have brought a great direction and are now widely used in a variety of sectors. One of the most important components of SI is feature extraction, which has a substantial impact on the SI process and performance. As a result, numerous feature extraction strategies are thoroughly investigated, contrasted, and analyzed. This article exploits five distinct feature extraction methods for speaker identification in disguised voices under emotional environments. To evaluate this work significantly, three effects are used: high-pitched, low-pitched, and Electronic Voice Conversion (EVC). Experimental results reported that the concatenated Mel-Frequency Cepstral Coefficients (MFCCs), MFCCs-delta, and MFCCs-delta-delta is the best feature extraction method.

Zero/Few/One-Shot|迁移|自适应(1篇)

【1】 Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases 标题：用于检测社会偏见的预训语言模型的小镜头教学提示链接：https://arxiv.org/abs/2112.07868

作者：Shrimai Prabhumoye,Rafal Kocielnik,Mohammad Shoeybi,Anima Anandkumar,Bryan Catanzaro 摘要：检测文本中的社会偏见具有挑战性，这是由于细微差别、主观性以及难以在一定规模上获得高质量的标记数据集，特别是考虑到社会偏见和社会的演变性质。为了应对这些挑战，我们提出了一些基于快照指令的方法来提示预训练语言模型（LMs）。我们从一个小的支持库中选择几个标签平衡的示例，这些示例与要在嵌入空间中标记的查询最接近。然后，我们向LM提供指令，该指令由标记样本的子集、要分类的查询文本、偏差的定义组成，并提示它做出决定。我们证明，在少数镜头环境中使用的大型LMs可以检测不同类型的细粒度偏差，其精度与微调模型类似，有时甚至更高。我们观察到，与较小的模型相比，最大的530B参数模型在检测社会偏见方面更为有效（与其他模型相比，AUC指标至少提高了20%）。它还可以在几次拍摄设置中保持较高的AUC（下降小于5%），标记的存储库减少到100个样本。因此，大型预先训练的语言模型使得构建新的偏差检测器变得更容易、更快。摘要：Detecting social bias in text is challenging due to nuance, subjectivity, and difficulty in obtaining good quality labeled datasets at scale, especially given the evolving nature of social biases and society. To address these challenges, we propose a few-shot instruction-based method for prompting pre-trained language models (LMs). We select a few label-balanced exemplars from a small support repository that are closest to the query to be labeled in the embedding space. We then provide the LM with instruction that consists of this subset of labeled exemplars, the query text to be classified, a definition of bias, and prompt it to make a decision. We demonstrate that large LMs used in a few-shot context can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models. We observe that the largest 530B parameter model is significantly more effective in detecting social bias compared to smaller models (achieving at least 20% improvement in AUC metric compared to other models). It also maintains a high AUC (dropping less than 5%) in a few-shot setting with a labeled repository reduced to as few as 100 samples. Large pretrained language models thus make it easier and quicker to build new bias detectors.

Word2Vec|文本|单词(2篇)

【1】 Simple Text Detoxification by Identifying a Linear Toxic Subspace in Language Model Embeddings 标题：在语言模型嵌入中识别线性有毒子空间的简单文本去毒方法链接：https://arxiv.org/abs/2112.08346

作者：Andrew Wang,Mohit Sudhakar,Yangfeng Ji 摘要：大型预先训练的语言模型通常针对大量互联网数据进行训练，其中一些可能包含有毒或辱骂性语言。因此，语言模型编码有害信息，这使得这些语言模型的实际使用受到限制。当前的方法旨在防止有毒特征出现在生成的文本中。我们假设在预先训练的语言模型的潜在空间中存在一个低维有毒子空间，这表明有毒特征遵循某种潜在模式，因此是可移除的。为了构造这个有毒子空间，我们提出了一种在潜在空间中推广有毒方向的方法。我们还提供了一种使用基于上下文的单词掩蔽系统构建并行数据集的方法。通过我们的实验，我们发现当有毒子空间从一组句子表征中移除时，结果中几乎没有有毒表征。我们的经验证明，使用我们的方法发现的子空间推广到多重毒性语料库，表明存在一个低维毒性子空间。摘要：Large pre-trained language models are often trained on large volumes of internet data, some of which may contain toxic or abusive language. Consequently, language models encode toxic information, which makes the real-world usage of these language models limited. Current methods aim to prevent toxic features from appearing generated text. We hypothesize the existence of a low-dimensional toxic subspace in the latent space of pre-trained language models, the existence of which suggests that toxic features follow some underlying pattern and are thus removable. To construct this toxic subspace, we propose a method to generalize toxic directions in the latent space. We also provide a methodology for constructing parallel datasets using a context based word masking system. Through our experiments, we show that when the toxic subspace is removed from a set of sentence representations, almost no toxic representations remain in the result. We demonstrate empirically that the subspace found using our method generalizes to multiple toxicity corpora, indicating the existence of a low-dimensional toxic subspace.

【2】 Tracing Text Provenance via Context-Aware Lexical Substitution 标题：基于上下文感知词汇替换的文本溯源链接：https://arxiv.org/abs/2112.07873

作者：Xi Yang,Jie Zhang,Kejiang Chen,Weiming Zhang,Zehua Ma,Feng Wang,Nenghai Yu 备注：Accepted by AAAI-2022 摘要：由人类或语言模型创建的文本内容经常被对手窃取或误用。追踪文本来源有助于声明文本内容的所有权，或识别传播误导性内容（如机器生成的假新闻）的恶意用户。已经有一些尝试来实现这一点，主要是基于水印技术。具体地说，传统的文本水印方法通过稍微改变文本格式（如行距和字体）来嵌入水印，然而，这些文本格式对OCR等跨媒体传输很脆弱。考虑到这一点，自然语言水印方法通过从手语词汇资源（例如，WordNet）中用同义词替换原始句子中的单词来表示水印，但是他们不考虑替换对整个句子意义的影响。最近，提出了一种基于变换器的网络，通过修改不显眼的词（如虚词）嵌入水印，这也会削弱句子的逻辑和语义连贯性。此外，一个训练有素的网络无法处理其他不同类型的文本内容。针对上述局限性，我们提出了一种基于上下文感知词汇替换（LS）的自然语言水印方案。具体地说，我们使用BERT通过推断候选词和原始句子之间的语义关系来推荐LS候选词。在此基础上，进一步设计了基于同步性和可替换性的选择策略，以测试一个字是否完全适合携带水印信号。大量实验表明，在客观和主观尺度下，我们的水印方案都能很好地保持原始句子的语义完整性，并且比现有方法具有更好的可转移性。此外，在斯坦福单词替换基准测试中，所提出的LS方法优于最先进的方法。摘要：Text content created by humans or language models is often stolen or misused by adversaries. Tracing text provenance can help claim the ownership of text content or identify the malicious users who distribute misleading content like machine-generated fake news. There have been some attempts to achieve this, mainly based on watermarking techniques. Specifically, traditional text watermarking methods embed watermarks by slightly altering text format like line spacing and font, which, however, are fragile to cross-media transmissions like OCR. Considering this, natural language watermarking methods represent watermarks by replacing words in original sentences with synonyms from handcrafted lexical resources (e.g., WordNet), but they do not consider the substitution's impact on the overall sentence's meaning. Recently, a transformer-based network was proposed to embed watermarks by modifying the unobtrusive words (e.g., function words), which also impair the sentence's logical and semantic coherence. Besides, one well-trained network fails on other different types of text content. To address the limitations mentioned above, we propose a natural language watermarking scheme based on context-aware lexical substitution (LS). Specifically, we employ BERT to suggest LS candidates by inferring the semantic relatedness between the candidates and the original sentence. Based on this, a selection strategy in terms of synchronicity and substitutability is further designed to test whether a word is exactly suitable for carrying the watermark signal. Extensive experiments demonstrate that, under both objective and subjective metrics, our watermarking scheme can well preserve the semantic integrity of original sentences and has a better transferability than existing methods. Besides, the proposed LS approach outperforms the state-of-the-art approach on the Stanford Word Substitution Benchmark.

其他神经网络|深度学习|模型|建模(8篇)

【1】 Measure and Improve Robustness in NLP Models: A Survey 标题：NLP模型中稳健性的度量与改进：综述链接：https://arxiv.org/abs/2112.08313

作者：Xuezhi Wang,Haohan Wang,Diyi Yang 摘要：随着NLP模型在基准上实现了最先进的性能，并获得了广泛的应用，确保这些模型在现实世界中的安全部署变得越来越重要，例如，确保模型对未知或具有挑战性的场景具有鲁棒性。尽管稳健性是一个日益被研究的主题，但它已在vision和NLP等应用中分别进行了探索，在多个研究领域有不同的定义、评估和缓解策略。在本文中，我们的目的是提供一个统一的调查，如何定义，衡量和提高鲁棒性自然语言处理。我们首先连接了稳健性的多种定义，然后统一了识别稳健性故障和评估模型稳健性的各种工作。相应地，我们提出了数据驱动、模型驱动和归纳先验的缓解策略，并对如何有效提高NLP模型的稳健性提出了更系统的观点。最后，我们总结了开放的挑战和未来的方向，以推动这一领域的进一步研究。摘要：As NLP models achieved state-of-the-art performances over benchmarks and gained wide applications, it has been increasingly important to ensure the safe deployment of these models in the real world, e.g., making sure the models are robust against unseen or challenging scenarios. Despite robustness being an increasingly studied topic, it has been separately explored in applications like vision and NLP, with various definitions, evaluation and mitigation strategies in multiple lines of research. In this paper, we aim to provide a unifying survey of how to define, measure and improve robustness in NLP. We first connect multiple definitions of robustness, then unify various lines of work on identifying robustness failures and evaluating models' robustness. Correspondingly, we present mitigation strategies that are data-driven, model-driven, and inductive-prior-based, with a more systematic view of how to effectively improve robustness in NLP models. Finally, we conclude by outlining open challenges and future directions to motivate further research in this area.

【2】 Learning Cross-Lingual IR from an English Retriever 标题：向英语检索者学习跨语言信息检索链接：https://arxiv.org/abs/2112.08185

作者：Yulong Li,Martin Franz,Md Arafat Sultan,Bhavani Iyer,Young-Suk Lee,Avirup Sil 备注：6 pages 摘要：我们提出了一个新的跨语言信息检索（CLIR）模型，该模型使用多阶段知识提取（KD）进行训练。教师和学生是异构系统，前者是一个依赖机器翻译和单语信息检索的管道，而后者执行单一的CLIR操作。我们证明，学生可以通过优化两个相应的KD目标来学习多语言表示和CLIR。使用一种新的跨语言对齐算法从纯英语检索器学习多语言表示，该算法贪婪地重新定位教师标记以进行对齐。对XOR-TyDi基准测试的评估表明，该模型比现有的跨语言标记IR数据微调方法更有效，精度提高25.4%Recall@5kt. 摘要：We present a new cross-lingual information retrieval (CLIR) model trained using multi-stage knowledge distillation (KD). The teacher and the student are heterogeneous systems-the former is a pipeline that relies on machine translation and monolingual IR, while the latter executes a single CLIR operation. We show that the student can learn both multilingual representations and CLIR by optimizing two corresponding KD objectives. Learning multilingual representations from an English-only retriever is accomplished using a novel cross-lingual alignment algorithm that greedily re-positions the teacher tokens for alignment. Evaluation on the XOR-TyDi benchmark shows that the proposed model is far more effective than the existing approach of fine-tuning with cross-lingual labeled IR data, with a gain in accuracy of 25.4 Recall@5kt.

【3】 One size does not fit all: Investigating strategies for differentially-private learning across NLP tasks 标题：一刀切并不适合所有人：研究跨NLP任务的不同私人学习策略链接：https://arxiv.org/abs/2112.08159

作者：Manuel Senge,Timour Igamberdiev,Ivan Habernal 摘要：在训练现代NLP模型时保护隐私是有代价的。我们知道，差分私有随机梯度下降（DP-SGD）中更严格的隐私保证通常会降低模型性能。然而，以前关于DP-SGD在NLP中的效率的研究是不确定的，甚至是违反直觉的。在这篇短文中，我们使用现代神经模型对五个不同的“典型”NLP任务中的七个下游数据集的不同隐私保护策略进行了深入分析，这些任务具有不同的复杂性。我们表明，与解决NLP任务的标准非私有方法（通常越大越好）不同，隐私保护策略不会表现出获胜模式，并且每个任务和隐私制度都需要特殊处理以实现足够的性能。摘要：Preserving privacy in training modern NLP models comes at a cost. We know that stricter privacy guarantees in differentially-private stochastic gradient descent (DP-SGD) generally degrade model performance. However, previous research on the efficiency of DP-SGD in NLP is inconclusive or even counter-intuitive. In this short paper, we provide a thorough analysis of different privacy preserving strategies on seven downstream datasets in five different `typical' NLP tasks with varying complexity using modern neural models. We show that unlike standard non-private approaches to solving NLP tasks, where bigger is usually better, privacy-preserving strategies do not exhibit a winning pattern, and each task and privacy regime requires a special treatment to achieve adequate performance.

【4】 Dynamic Human Evaluation for Relative Model Comparisons 标题：用于相对模型比较的动态人体评价链接：https://arxiv.org/abs/2112.08048

作者：Thórhildur Thorleiksdóttir,Cedric Renggli,Nora Hollenstein,Ce Zhang 摘要：收集人类的判断是目前自然语言生成系统最可靠的评估方法。自动度量在用于测量生成文本的质量方面时报告了缺陷，并且已经证明与人类的判断关联性很差。然而，人体评估是时间和成本密集型的，我们在设计和进行人体评估实验方面缺乏共识。因此，在评估自然语言生成系统时，需要简化方法来有效收集人类的判断。因此，我们提出了一种动态方法，用于在相对比较设置中评估生成的输出时测量所需的人工注释数量。我们提出了一个基于代理的人类评估框架，以评估多种标记策略和方法，从而在模拟和众包案例研究中确定更好的模型。主要结果表明，在不同的标签策略中，可以很高的概率做出关于优越模型的决策，其中每个任务分配一个随机工作人员所需的总体标签工作最少，因此成本最低。摘要：Collecting human judgements is currently the most reliable evaluation method for natural language generation systems. Automatic metrics have reported flaws when applied to measure quality aspects of generated text and have been shown to correlate poorly with human judgements. However, human evaluation is time and cost-intensive, and we lack consensus on designing and conducting human evaluation experiments. Thus there is a need for streamlined approaches for efficient collection of human judgements when evaluating natural language generation systems. Therefore, we present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings. We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study. The main results indicate that a decision about the superior model can be made with high probability across different labelling strategies, where assigning a single random worker per task requires the least overall labelling effort and thus the least cost.

【5】 Solving the Data Sparsity Problem in Predicting the Success of the Startups with Machine Learning Methods 标题：用机器学习方法解决创业成功预测中的数据稀疏性问题链接：https://arxiv.org/abs/2112.07985

作者：Dafei Yin,Jing Li,Gaosheng Wu 摘要：预测创业公司的成功对于创业公司和投资者都非常重要。由于缺乏可用的数据和适当的一般方法，这是很困难的。借助Crunchbase等数据平台聚合初创公司的信息，可以使用机器学习算法进行预测。现有研究存在数据稀疏的问题，因为大多数早期初创公司没有太多可供公众使用的数据。我们试图利用最新的算法来解决这个问题。我们在Crunchbase的大数据集上研究了几种机器学习算法。结果表明，LightGBM和XGBoost表现最好，F1成绩分别达到53.03%和52.96%。我们从特征贡献的角度来解释预测。我们根据这些模型构建投资组合，并取得了较高的成功率。这些发现对机器学习方法如何帮助初创公司和投资者具有重要意义。摘要：Predicting the success of startup companies is of great importance for both startup companies and investors. It is difficult due to the lack of available data and appropriate general methods. With data platforms like Crunchbase aggregating the information of startup companies, it is possible to predict with machine learning algorithms. Existing research suffers from the data sparsity problem as most early-stage startup companies do not have much data available to the public. We try to leverage the recent algorithms to solve this problem. We investigate several machine learning algorithms with a large dataset from Crunchbase. The results suggest that LightGBM and XGBoost perform best and achieve 53.03% and 52.96% F1 scores. We interpret the predictions from the perspective of feature contribution. We construct portfolios based on the models and achieve high success rates. These findings have substantial implications on how machine learning methods can help startup companies and investors.

【6】 Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains 标题：Lex Rosetta：跨语言、司法管辖区和法律域的预测模型转移链接：https://arxiv.org/abs/2112.07882

作者：Jaromir Savelka,Hannes Westermann,Karim Benyekhlef,Charlotte S. Alexander,Jayla C. Grant,David Restrepo Amariles,Rajaa El Hamdani,Sébastien Meeùs,Michał Araszkiewicz,Kevin D. Ashley,Alexandra Ashley,Karl Branting,Mattia Falduti,Matthias Grabmair,Jakub Harašta,Tereza Novotná,Elizabeth Tippett,Shiwanni Johnson 备注：None 摘要：在本文中，我们研究了多语言句子嵌入的使用，以将预测模型用于跨司法管辖区、法律体系（普通法和民法）、语言和领域（即上下文）的判决判决功能分割。在原始语境之外利用语言资源的机制在人工智能和法律领域具有重大的潜在利益，因为法律体系、语言或传统之间的差异往往阻碍研究成果的广泛采用。我们分析了在序列标记模型中使用语言不可知句表示的情况，使用了跨语言可转移的门循环单位（GRU）。为了研究不同语境之间的转换，我们开发了一种用于判决判决功能分割的注释方案。我们发现，模型的泛化超出了训练的背景（例如，美国的行政决策训练模型可适用于意大利的刑法决策）。此外，我们发现在多个上下文上训练模型可以提高鲁棒性，并在以前看不见的上下文上评估时提高整体性能。最后，我们发现将来自所有上下文的训练数据汇集在一起可以提高模型的上下文内性能。摘要：In this paper, we examine the use of multi-lingual sentence embeddings to transfer predictive models for functional segmentation of adjudicatory decisions across jurisdictions, legal systems (common and civil law), languages, and domains (i.e. contexts). Mechanisms for utilizing linguistic resources outside of their original context have significant potential benefits in AI & Law because differences between legal systems, languages, or traditions often block wider adoption of research outcomes. We analyze the use of Language-Agnostic Sentence Representations in sequence labeling models using Gated Recurrent Units (GRUs) that are transferable across languages. To investigate transfer between different contexts we developed an annotation scheme for functional segmentation of adjudicatory decisions. We found that models generalize beyond the contexts on which they were trained (e.g., a model trained on administrative decisions from the US can be applied to criminal law decisions from Italy). Further, we found that training the models on multiple contexts increases robustness and improves overall performance when evaluating on previously unseen contexts. Finally, we found that pooling the training data from all the contexts enhances the models' in-context performance.

【7】 Learning to Transpile AMR into SPARQL 标题：学习将AMR转换为SPARQL 链接：https://arxiv.org/abs/2112.07877

作者：Mihaela Bornea,Ramon Fernandez Astudillo,Tahira Naseem,Nandana Mihindukulasooriya,Ibrahim Abdelaziz,Pavan Kapanipathi,Radu Florian,Salim Roukos 摘要：我们提出了一个基于转换的系统，将抽象意义表示（AMR）转换为SPARQL，用于知识库问答（KBQA）。这允许将抽象问题的一部分委托给经过严格训练的语义解析器，同时学习使用少量成对数据进行传输。我们与最近有关AMR和SPARQL构造的工作不同，但我们不是应用一组规则，而是教导BART模型有选择地使用这些关系。此外，在最近的语义分析工作之后，我们避免显式编码AMR，而是在BART的注意机制中编码解析器状态。由此产生的模型很简单，为其决策提供了支持性文本，并且优于LC QuAD（F1 53.4）中基于AMR的KBQA的最新进展，与QALD（F1 30.8）相匹配，同时利用了相同的电感偏置。摘要：We propose a transition-based system to transpile Abstract Meaning Representation (AMR) into SPARQL for Knowledge Base Question Answering (KBQA). This allows to delegate part of the abstraction problem to a strongly pre-trained semantic parser, while learning transpiling with small amount of paired data. We departure from recent work relating AMR and SPARQL constructs, but rather than applying a set of rules, we teach the BART model to selectively use these relations. Further, we avoid explicitly encoding AMR but rather encode the parser state in the attention mechanism of BART, following recent semantic parsing works. The resulting model is simple, provides supporting text for its decisions, and outperforms recent progress in AMR-based KBQA in LC-QuAD (F1 53.4), matching it in QALD (F1 30.8), while exploiting the same inductive biases.

【8】 Fine-Tuning Large Neural Language Models for Biomedical Natural Language Processing 标题：面向生物医学自然语言处理的大型神经语言模型微调链接：https://arxiv.org/abs/2112.07869

作者：Robert Tinn,Hao Cheng,Yu Gu,Naoto Usuyama,Xiaodong Liu,Tristan Naumann,Jianfeng Gao,Hoifung Poon 摘要：动机：生物医学研究人员和临床从业者面临的一个长期挑战是跟上出版物和医学笔记的快速增长。自然语言处理（NLP）已成为抑制信息过载的一个有希望的方向。特别是，大型神经语言模型通过对未标记文本进行预训练来促进迁移学习，BERT模型在各种NLP应用中的成功就是一个例子。然而，为最终任务微调此类模型仍然具有挑战性，特别是对于小型标记数据集，这在生物医学NLP中很常见。结果：我们对生物医学NLP的微调稳定性进行了系统研究。我们表明，微调性能可能对预训练设置敏感，尤其是在低资源域中。大型模型有可能获得更好的性能，但模型尺寸的增加也会加剧微调不稳定性。因此，我们对解决微调不稳定性的技术进行了全面的探索。我们表明，这些技术可以大大提高低资源生物医学NLP应用程序的微调性能。具体而言，冻结下层有助于标准的BERT-BASE模型，而分层衰减对BERT-LARGE和ELECTRA模型更有效。对于bioses等低资源文本相似性任务，重新初始化顶层是最佳策略。总的来说，特定领域的词汇表和预训练有助于更健壮的模型进行微调。基于这些发现，我们在广泛的生物医学NLP应用领域建立了新的技术水平。可用性和实施：为了促进生物医学NLP的进展，我们发布了最先进的预训练和微调模型：https://aka.ms/BLURB. 摘要：Motivation: A perennial challenge for biomedical researchers and clinical practitioners is to stay abreast with the rapid growth of publications and medical notes. Natural language processing (NLP) has emerged as a promising direction for taming information overload. In particular, large neural language models facilitate transfer learning by pretraining on unlabeled text, as exemplified by the successes of BERT models in various NLP applications. However, fine-tuning such models for an end task remains challenging, especially with small labeled datasets, which are common in biomedical NLP. Results: We conduct a systematic study on fine-tuning stability in biomedical NLP. We show that finetuning performance may be sensitive to pretraining settings, especially in low-resource domains. Large models have potential to attain better performance, but increasing model size also exacerbates finetuning instability. We thus conduct a comprehensive exploration of techniques for addressing fine-tuning instability. We show that these techniques can substantially improve fine-tuning performance for lowresource biomedical NLP applications. Specifically, freezing lower layers is helpful for standard BERT-BASE models, while layerwise decay is more effective for BERT-LARGE and ELECTRA models. For low-resource text similarity tasks such as BIOSSES, reinitializing the top layer is the optimal strategy. Overall, domainspecific vocabulary and pretraining facilitate more robust models for fine-tuning. Based on these findings, we establish new state of the art on a wide range of biomedical NLP applications. Availability and implementation: To facilitate progress in biomedical NLP, we release our state-of-the-art pretrained and fine-tuned models: https://aka.ms/BLURB.

其他(14篇)

【1】 Design Challenges for a Multi-Perspective Search Engine 标题：多视角搜索引擎的设计挑战链接：https://arxiv.org/abs/2112.08357

作者：Sihao Chen,Siyi Liu,Xander Uyttendaele,Yi Zhang,William Bruno,Dan Roth 摘要：许多用户求助于文档检索系统（如搜索引擎）来寻找有争议问题的答案。回答这样的用户查询通常需要在web文档中识别响应，并根据不同的视角聚合响应。经典的文档检索系统在向用户提供一组直接和多样的响应方面存在不足。当然，在文档中识别这样的响应是一项自然的语言理解任务。在本文中，我们研究了综合这些语言理解目标与文献检索的挑战，并研究了一种新的面向视角的文献检索范式。为了实现这一目标，我们讨论并评估了固有的自然语言理解挑战。根据设计挑战和原则，我们演示并评估了一个实际的原型管道系统。我们使用原型系统进行用户调查，以评估我们的范例的效用，并了解用户对有争议的查询的信息需求。摘要：Many users turn to document retrieval systems (e.g. search engines) to seek answers to controversial questions. Answering such user queries usually require identifying responses within web documents, and aggregating the responses based on their different perspectives. Classical document retrieval systems fall short at delivering a set of direct and diverse responses to the users. Naturally, identifying such responses within a document is a natural language understanding task. In this paper, we examine the challenges of synthesizing such language understanding objectives with document retrieval, and study a new perspective-oriented document retrieval paradigm. We discuss and assess the inherent natural language understanding challenges in order to achieve the goal. Following the design challenges and principles, we demonstrate and evaluate a practical prototype pipeline system. We use the prototype system to conduct a user survey in order to assess the utility of our paradigm, as well as understanding the user information needs for controversial queries.

【2】 Database Search Results Disambiguation for Task-Oriented Dialog Systems 标题：面向任务对话系统的数据库搜索结果消歧链接：https://arxiv.org/abs/2112.08351

作者：Kun Qian,Ahmad Beirami,Satwik Kottur,Shahin Shayandeh,Paul Crook,Alborz Geramifard,Zhou Yu,Chinnadhurai Sankar 摘要：随着面向任务的对话系统在我们的生活中越来越流行，人们提出并探索了更现实的任务。然而，出现了新的实际挑战。例如，当前的对话系统在查询数据库时无法有效地处理多个搜索结果，因为在现有的公共数据集中缺少这样的场景。在本文中，我们提出了数据库搜索结果（DSR）消歧，这是一项新的任务，其重点是消除数据库搜索结果的歧义，通过允许用户从多个选项而不是仅从一个选项中进行选择来增强用户体验。为了研究这项任务，我们增加了流行的面向任务的对话数据集（MultiWOZ和SGD），通过（a）通过预定义语法综合生成回合，以及（b）收集子集的人类释义来解决歧义。我们发现，对增强的对话数据进行训练可以提高模型处理模糊场景的能力，而不会牺牲未修改回合的性能。此外，预微调和多任务学习有助于我们的模型在没有域内数据的情况下提高DSR消歧的性能，这表明它可以作为一种通用的对话技巧来学习。我们的数据和代码将公开。摘要：As task-oriented dialog systems are becoming increasingly popular in our lives, more realistic tasks have been proposed and explored. However, new practical challenges arise. For instance, current dialog systems cannot effectively handle multiple search results when querying a database, due to the lack of such scenarios in existing public datasets. In this paper, we propose Database Search Result (DSR) Disambiguation, a novel task that focuses on disambiguating database search results, which enhances user experience by allowing them to choose from multiple options instead of just one. To study this task, we augment the popular task-oriented dialog datasets (MultiWOZ and SGD) with turns that resolve ambiguities by (a) synthetically generating turns through a pre-defined grammar, and (b) collecting human paraphrases for a subset. We find that training on our augmented dialog data improves the model's ability to deal with ambiguous scenarios, without sacrificing performance on unmodified turns. Furthermore, pre-fine tuning and multi-task learning help our model to improve performance on DSR-disambiguation even in the absence of in-domain data, suggesting that it can be learned as a universal dialog skill. Our data and code will be made publicly available.

【3】 PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts 标题：即兴任性：连续提示的离散化解读奇案链接：https://arxiv.org/abs/2112.08348

作者：Daniel Khashabi,Shane Lyu,Sewon Min,Lianhui Qin,Kyle Richardson,Sameer Singh,Sean Welleck,Hannaneh Hajishirzi,Tushar Khot,Ashish Sabharwal,Yejin Choi 备注：Work in Progress 摘要：针对目标任务的微调连续提示最近已成为全模型微调的紧凑替代方案。基于这些有希望的结果，我们研究了提取连续提示的离散（文本）解释的可行性，该解释忠实于提示所解决的问题。在实践中，我们观察到连续提示解决的任务与其最近邻离散投影之间的“任性”行为：我们可以找到连续提示解决任务，同时投影到任意文本（例如，不同或甚至矛盾任务的定义），而在任务相同大小的最佳连续提示的极小（2%）范围内。我们提供了这种奇怪和令人惊讶的行为背后的直觉，以及量化各种参数影响的广泛实证分析。例如，对于较大的模型尺寸，我们观察到更高的任性，也就是说，我们可以找到更接近于任意文本的提示，其准确性下降较小。这些发现对于忠实地解释连续提示的难度以及它们在模型和任务中的泛化具有重要意义，为提示语言模型的未来进展提供了指导。摘要：Fine-tuning continuous prompts for target tasks has recently emerged as a compact alternative to full model fine-tuning. Motivated by these promising results, we investigate the feasibility of extracting a discrete (textual) interpretation of continuous prompts that is faithful to the problem they solve. In practice, we observe a "wayward" behavior between the task solved by continuous prompts and their nearest neighbor discrete projections: We can find continuous prompts that solve a task while being projected to an arbitrary text (e.g., definition of a different or even a contradictory task), while being within a very small (2%) margin of the best continuous prompt of the same size for the task. We provide intuitions behind this odd and surprising behavior, as well as extensive empirical analyses quantifying the effect of various parameters. For instance, for larger model sizes we observe higher waywardness, i.e, we can find prompts that more closely map to any arbitrary text with a smaller drop in accuracy. These findings have important implications relating to the difficulty of faithfully interpreting continuous prompts and their generalization across models and tasks, providing guidance for future progress in prompting language models.

【4】 AllWOZ: Towards Multilingual Task-Oriented Dialog Systems for All 标题：AllWOZ：面向所有人的面向任务的多语言对话系统链接：https://arxiv.org/abs/2112.08333

作者：Lei Zuo,Kun Qian,Bowen Yang,Zhou Yu 摘要：亚马逊Alexa和苹果Siri等最先进的自然语言技术的一个普遍存在的问题是，由于语言障碍，它们的服务无法扩展到大多数发展中国家的公民。由于缺乏可用的语言资源来构建NLP产品，这些人口受到影响。本文介绍了AllWOZ，一个多语言、多领域、面向任务的客户服务对话数据集，包括八种语言：英语、普通话、韩语、越南语、印地语、法语、葡萄牙语和泰语。此外，我们通过应用mT5和元学习为我们的多语言数据集创建了一个基准。摘要：A commonly observed problem of the state-of-the-art natural language technologies, such as Amazon Alexa and Apple Siri, is that their services do not extend to most developing countries' citizens due to language barriers. Such populations suffer due to the lack of available resources in their languages to build NLP products. This paper presents AllWOZ, a multilingual multi-domain task-oriented customer service dialog dataset covering eight languages: English, Mandarin, Korean, Vietnamese, Hindi, French, Portuguese, and Thai. Furthermore, we create a benchmark for our multilingual dataset by applying mT5 with meta-learning.

【5】 CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance 标题：CheckDST：测量对话状态跟踪性能的真实泛化链接：https://arxiv.org/abs/2112.08321

作者：Hyundong Cho,Chinnadhurai Sankar,Christopher Lin,Kaushik Ram Sadagopan,Shahin Shayandeh,Asli Celikyilmaz,Jonathan May,Ahmad Beirami 摘要：最近的神经模型扩展了pretrain-then-finetune范式，继续在对话状态跟踪（DST）基准的联合目标准确性（JGA）方面取得新的最新成果。然而，我们对它们的鲁棒性提出了质疑，因为它们在包含话语或对话流的对话中显示出JGA的急剧下降。受检查表（Ribeiro et al.，2020）的启发，我们设计了一个称为CheckDST的度量集合，通过使用扩展测试集测试已知的弱点，促进DST模型在稳健性综合维度上的比较。我们使用CheckDST对最近的DST模型进行评估，并认为应更全面地评估模型，而不是追求JGA的最新水平，因为较高的JGA不能保证更好的整体稳健性。我们发现，基于广度的分类模型对不可见的命名实体具有弹性，但对语言多样性不具有鲁棒性，而基于自回归语言模型的分类模型对语言多样性具有更好的泛化能力，但倾向于记忆命名实体，并且经常产生幻觉。由于各自的弱点，这两种方法都不适合实际部署。我们相信，CheckDST是一个有用的指南，可用于未来研究，以开发体现各种方法优势的面向任务的对话模型。摘要：Recent neural models that extend the pretrain-then-finetune paradigm continue to achieve new state-of-the-art results on joint goal accuracy (JGA) for dialogue state tracking (DST) benchmarks. However, we call into question their robustness as they show sharp drops in JGA for conversations containing utterances or dialog flows with realistic perturbations. Inspired by CheckList (Ribeiro et al., 2020), we design a collection of metrics called CheckDST that facilitate comparisons of DST models on comprehensive dimensions of robustness by testing well-known weaknesses with augmented test sets. We evaluate recent DST models with CheckDST and argue that models should be assessed more holistically rather than pursuing state-of-the-art on JGA since a higher JGA does not guarantee better overall robustness. We find that span-based classification models are resilient to unseen named entities but not robust to language variety, whereas those based on autoregressive language models generalize better to language variety but tend to memorize named entities and often hallucinate. Due to their respective weaknesses, neither approach is yet suitable for real-world deployment. We believe CheckDST is a useful guide for future research to develop task-oriented dialogue models that embody the strengths of various methods.

【6】 Decomposing Natural Logic Inferences in Neural NLI 标题：神经NLI中自然逻辑推理的分解链接：https://arxiv.org/abs/2112.08289

作者：Julia Rozanova,Deborah Ferreira,Marco Valentino,Mokanrarangan Thayaparan,Andre Freitas 摘要：为了解释神经NLI模型及其推理策略，我们进行了一项系统的探索性研究，调查这些模型是否捕捉到自然逻辑的关键语义特征：单调性和概念包含性。正确识别向下单调语境中的有效推理是NLI性能的已知绊脚石，包括否定范围和广义量词等语言现象。为了理解这一困难，我们强调单调性是上下文的一个属性，并检查模型在多大程度上捕获了上下文嵌入中的单调性信息，而上下文嵌入是其决策过程的中间环节。根据探测范式的最新进展，我们比较了各种模型中单调性特征的存在。我们发现，在基准测试中获得高分的流行NLI模型的表示中，单调性信息明显较弱，并且观察到，以前基于微调策略对这些模型进行的改进引入了更强的单调性特征，以及它们在挑战集上的改进性能。摘要：In the interest of interpreting neural NLI models and their reasoning strategies, we carry out a systematic probing study which investigates whether these models capture the crucial semantic features central to natural logic: monotonicity and concept inclusion. Correctly identifying valid inferences in downward-monotone contexts is a known stumbling block for NLI performance, subsuming linguistic phenomena such as negation scope and generalized quantifiers. To understand this difficulty, we emphasize monotonicity as a property of a context and examine the extent to which models capture monotonicity information in the contextual embeddings which are intermediate to their decision making process. Drawing on the recent advancement of the probing paradigm, we compare the presence of monotonicity features across various models. We find that monotonicity information is notably weak in the representations of popular NLI models which achieve high scores on benchmarks, and observe that previous improvements to these models based on fine-tuning strategies have introduced stronger monotonicity features together with their improved performance on challenge sets.

【7】 Est-ce que vous compute? Code-switching, cultural identity, and AI 标题：Est-ce Que Vous Computer？语码转换、文化认同与人工智能链接：https://arxiv.org/abs/2112.08256

作者：Arianna Falbo,Travis LaCroix 备注：19 pages. Under Review. Please cite published version, if available 摘要：文化语码转换关系到我们如何调整自己的整体行为、说话方式和外表，以应对社会环境的变化。我们认为有必要研究人工智能系统中的文化代码转换能力。我们探讨了一系列的伦理和认知问题，当文化代码转换对人工智能产生影响时会出现这些问题。基于Dotson（2014）对证明性窒息的分析，我们讨论了人工智能中的新兴技术如何导致认知压迫，特别是一种我们称之为“文化窒息”的自我沉默。如果不解决文化代码转换的社会动态特征，人工智能系统可能会扩大机会差距，进一步巩固社会不平等，从而对已经边缘化的社会群体产生负面影响。摘要：Cultural code-switching concerns how we adjust our overall behaviours, manners of speaking, and appearance in response to a perceived change in our social environment. We defend the need to investigate cultural code-switching capacities in artificial intelligence systems. We explore a series of ethical and epistemic issues that arise when bringing cultural code-switching to bear on artificial intelligence. Building upon Dotson's (2014) analysis of testimonial smothering, we discuss how emerging technologies in AI can give rise to epistemic oppression, and specifically, a form of self-silencing that we call 'cultural smothering'. By leaving the socio-dynamic features of cultural code-switching unaddressed, AI systems risk negatively impacting already-marginalised social groups by widening opportunity gaps and further entrenching social inequalities.

【8】 Improving Conversational Recommendation Systems' Quality with Context-Aware Item Meta Information 标题：利用上下文感知项元信息提高会话推荐系统的质量链接：https://arxiv.org/abs/2112.08140

作者：Bowen Yang,Cong Han,Yu Li,Lei Zuo,Zhou Yu 摘要：会话推荐系统（CRS）通过从对话历史中推断用户偏好、提供准确的推荐并生成适当的响应来与用户互动。以前的CRS使用基于知识图（KG）的推荐模块，并将KG与语言模型集成以生成响应。尽管基于KG的方法证明是有效的，但仍有两个问题有待解决。首先，基于KG的方法忽略了会话上下文中的信息，而只依赖实体关系和一袋单词来推荐项目。第二，它需要大量的工程工作来维护建模领域特定关系的KG，从而降低灵活性。在本文中，我们提出了一个简单而有效的体系结构，包括一个预训练语言模型（PLM）和一个项目元数据编码器。编码器学习将项元数据映射到能够反映对话框上下文中语义信息的嵌入。PLM然后使用语义一致的项嵌入和对话框上下文来生成高质量的建议和响应。我们的模型没有使用KG建模实体关系，而是通过将每个项直接转换为嵌入项来降低工程复杂性。在基准数据集ReDial上的实验结果表明，我们的模型在推荐和响应生成任务上都获得了最新的结果。摘要：Conversational recommendation systems (CRS) engage with users by inferring user preferences from dialog history, providing accurate recommendations, and generating appropriate responses. Previous CRSs use knowledge graph (KG) based recommendation modules and integrate KG with language models for response generation. Although KG-based approaches prove effective, two issues remain to be solved. First, KG-based approaches ignore the information in the conversational context but only rely on entity relations and bag of words to recommend items. Second, it requires substantial engineering efforts to maintain KGs that model domain-specific relations, thus leading to less flexibility. In this paper, we propose a simple yet effective architecture comprising a pre-trained language model (PLM) and an item metadata encoder. The encoder learns to map item metadata to embeddings that can reflect the semantic information in the dialog context. The PLM then consumes the semantic-aligned item embeddings together with dialog context to generate high-quality recommendations and responses. Instead of modeling entity relations with KGs, our model reduces engineering complexity by directly converting each item to an embedding. Experimental results on the benchmark dataset ReDial show that our model obtains state-of-the-art results on both recommendation and response generation tasks.

【9】 Large Dual Encoders Are Generalizable Retrievers 标题：大型双编码器是可推广的检索器链接：https://arxiv.org/abs/2112.07899

作者：Jianmo Ni,Chen Qu,Jing Lu,Zhuyun Dai,Gustavo Hernández Ábrego,Ji Ma,Vincent Y. Zhao,Yi Luan,Keith B. Hall,Ming-Wei Chang,Yinfei Yang 摘要：研究表明，在一个域上训练的双编码器通常无法推广到其他域进行检索任务。人们普遍认为，双编码器的瓶颈层（最终分数只是查询向量和通道向量之间的点积）过于有限，无法使双编码器成为有效的域外泛化检索模型。在本文中，我们通过增大双编码器模型{em的大小，同时保持瓶颈嵌入大小不变，来挑战这一信念令人惊讶的是，通过多阶段训练，模型大小的增大可以显著改善各种检索任务，特别是对于域外泛化。实验结果表明，我们的基于GTR的双编码器textbf{G}通用textbf{T}5的稠密textbf{R}etrievers（GTR）显著优于BEIR数据集上的%ColBERT~cite{khattab2020colbert}和现有的稀疏和稠密检索器。最令人惊讶的是，我们的消融研究发现，全球技术法规是非常有效的数据，因为它只需要10%的MS Marco监督数据即可实现最佳的域外性能。所有全球技术法规型号在https://tfhub.dev/google/collections/gtr/1. 摘要：It has been shown that dual encoders trained on one domain often fail to generalize to other domains for retrieval tasks. One widespread belief is that the bottleneck layer of a dual encoder, where the final score is simply a dot-product between a query vector and a passage vector, is too limited to make dual encoders an effective retrieval model for out-of-domain generalization. In this paper, we challenge this belief by scaling up the size of the dual encoder model {em while keeping the bottleneck embedding size fixed.} With multi-stage training, surprisingly, scaling up the model size brings significant improvement on a variety of retrieval tasks, especially for out-of-domain generalization. Experimental results show that our dual encoders, textbf{G}eneralizable textbf{T}5-based dense textbf{R}etrievers (GTR), outperform %ColBERT~cite{khattab2020colbert} and existing sparse and dense retrievers on the BEIR dataset~cite{thakur2021beir} significantly. Most surprisingly, our ablation study finds that GTR is very data efficient, as it only needs 10% of MS Marco supervised data to achieve the best out-of-domain performance. All the GTR models are released at https://tfhub.dev/google/collections/gtr/1.

【10】 Event Linking: Grounding Event Mentions to Wikipedia 标题：事件链接：将事件提及接地到维基百科链接：https://arxiv.org/abs/2112.07888

作者：Xiaodong Yu,Wenpeng Yin,Nitish Gupta,Dan Roth 备注：9 pages, 9 tables, 1 figure 摘要：理解一篇文章需要理解它的组成部分。然而，提及事件的上下文通常缺少该事件的细节。那么，除了上下文之外，我们在哪里可以获得更多关于这个特定事件的知识呢？这项工作定义了事件链接，这是一项新的自然语言理解任务。事件链接尝试将出现在新闻文章中的事件提及链接到最合适的维基百科页面。本页将提供有关事件所指内容的丰富知识。为了规范这一新问题的研究，我们从三个方面做出了贡献。首先，这是社区中第一个正式定义事件链接任务的工作。其次，我们为这个新任务收集一个数据集。具体来说，我们首先从Wikipedia自动收集训练集，然后创建两个评估集：一个来自Wikipedia域，报告域内性能；另一个来自真实世界的新闻域，测试域外性能。第三，我们提出了EveLINK，这是有史以来第一种事件链接方法。总的来说，事件链接是一项相当具有挑战性的任务，需要社区付出更多的努力。此处提供了数据和代码：https://github.com/CogComp/event-linking. 摘要：Comprehending an article requires understanding its constituent events. However, the context where an event is mentioned often lacks the details of this event. Then, where can we obtain more knowledge of this particular event in addition to its context? This work defines Event Linking, a new natural language understanding task at the event level. Event linking tries to link an event mention, appearing in a news article for example, to the most appropriate Wikipedia page. This page is expected to provide rich knowledge about what the event refers to. To standardize the research of this new problem, we contribute in three-fold. First, this is the first work in the community that formally defines event linking task. Second, we collect a dataset for this new task. In specific, we first gather training set automatically from Wikipedia, then create two evaluation sets: one from the Wikipedia domain as well, reporting the in-domain performance; the other from the real-world news domain, testing the out-of-domain performance. Third, we propose EveLINK, the first-ever Event Linking approach. Overall, event linking is a considerably challenging task requiring more effort from the community. Data and code are available here: https://github.com/CogComp/event-linking.

【11】 Online anti-Semitism across platforms 标题：跨平台的网上反犹太主义链接：https://arxiv.org/abs/2112.07783

作者：Tom De Smedt 备注：None 摘要：我们创建了一个用于检测反犹太主义的细粒度人工智能系统。这一可解释的人工智能将在跨平台的在线社交媒体信息中识别英语和德语反犹表达的非人化、言语攻击和阴谋，以支持高层决策。摘要：We created a fine-grained AI system for the detection of anti-Semitism. This Explainable AI will identify English and German anti-Semitic expressions of dehumanization, verbal aggression and conspiracies in online social media messages across platforms, to support high-level decision making.

【12】 Boosted Dense Retriever 标题：增强型密集寻回犬链接：https://arxiv.org/abs/2112.07771

作者：Patrick Lewis,Barlas Oğuz,Wenhan Xiong,Fabio Petroni,Wen-tau Yih,Sebastian Riedel 摘要：我们提出了DrBoost，一个受boosting启发的密集检索集成。DrBoost是分阶段训练的：每个组件模型都是按顺序学习的，并且只关注当前集成所犯的检索错误。最终的表示是所有组件模型的输出向量的串联，使其在测试时成为标准密集检索器的替代品。与标准的密集检索模型相比，DrBoost有几个优点。它生成的表示更紧凑4倍，同时提供可比较的检索结果。在粗略量化的近似搜索下，它的性能也出人意料地好，将延迟和带宽需求又减少了4倍。在实践中，这可以区分从磁盘提供索引和从内存提供索引，从而为更便宜的部署铺平道路。摘要：We propose DrBoost, a dense retrieval ensemble inspired by boosting. DrBoost is trained in stages: each component model is learned sequentially and specialized by focusing only on retrieval mistakes made by the current ensemble. The final representation is the concatenation of the output vectors of all the component models, making it a drop-in replacement for standard dense retrievers at test time. DrBoost enjoys several advantages compared to standard dense retrieval models. It produces representations which are 4x more compact, while delivering comparable retrieval results. It also performs surprisingly well under approximate search with coarse quantization, reducing latency and bandwidth needs by another 4x. In practice, this can make the difference between serving indices from disk versus from memory, paving the way for much cheaper deployments.

【13】 Classifying Emails into Human vs Machine Category 标题：将电子邮件分类为人与机器类别链接：https://arxiv.org/abs/2112.07742

作者：Changsung Kang,Hongwei Shang,Jean-Marc Langlois 备注：This paper is accepted by AAAI'22 摘要：区分个人和机器生成的电子邮件是Yahoo Mail的基本产品要求。Yahoo Mail中的旧产品分类器基于简单的逻辑回归模型。该模型是通过在SMTP地址级别聚合功能来训练的。我们建议在消息级别构建深度学习模型。我们构建并训练了四个独立的CNN模型：（1）以主题和内容为输入的内容模型；（2）输入发件人电子邮件地址和姓名的发件人模型；（3）通过分析电子邮件收件人的操作模式，并根据发件人的打开/删除行为相应地生成目标标签的操作模型；（4）利用发送者的“显式称呼”信号作为正面标记的称呼模型。接下来，在探索上述四种模型的不同组合后，我们构建了最终的完整模型。编辑数据的实验结果表明，与旧的生产模型相比，我们的完整模型将调整后的召回率从70.5%提高到78.8%，同时将准确率从94.7%提高到96.0%。在这项任务中，我们的完整模型也大大优于最先进的BERT模型。此完整模型已部署到当前的生产系统（Yahoo Mail 6）。摘要：It is an essential product requirement of Yahoo Mail to distinguish between personal and machine-generated emails. The old production classifier in Yahoo Mail was based on a simple logistic regression model. That model was trained by aggregating features at the SMTP address level. We propose building deep learning models at the message level. We built and trained four individual CNN models: (1) a content model with subject and content as input; (2) a sender model with sender email address and name as input; (3) an action model by analyzing email recipients' action patterns and correspondingly generating target labels based on senders' opening/deleting behaviors; (4) a salutation model by utilizing senders' "explicit salutation" signal as positive labels. Next, we built a final full model after exploring different combinations of the above four models. Experimental results on editorial data show that our full model improves the adjusted-recall from 70.5% to 78.8% compared to the old production model, while at the same time lifts the precision from 94.7% to 96.0%. Our full model also significantly beats the state-of-the-art Bert model at this task. This full model has been deployed into the current production system (Yahoo Mail 6).

【14】 Representing Inferences and their Lexicalization 标题：表征推理及其词汇化链接：https://arxiv.org/abs/2112.07711

作者：David McDonald,James Pustejovsky 备注：None 摘要：我们最近开始了一个项目，旨在开发一种更有效的方法，从背景知识中归纳推理，以促进对自然语言的深入理解。一个词的意义被认为是它对一个正在进行的情况所增加的实体、断言、预设和潜在的推论。随着词语的组合，情境中的最小模型演变为限制和直接推理。在这一点上，我们已经开发了我们的计算架构，并在真实文本上实现了它。我们的重点一直是证明我们设计的可行性。摘要：We have recently begun a project to develop a more effective and efficient way to marshal inferences from background knowledge to facilitate deep natural language understanding. The meaning of a word is taken to be the entities, predications, presuppositions, and potential inferences that it adds to an ongoing situation. As words compose, the minimal model in the situation evolves to limit and direct inference. At this point we have developed our computational architecture and implemented it on real text. Our focus has been on proving the feasibility of our design.

linux https 网络安全 NLP服务 tcp/ip

0 人点赞