Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.CL 方向,今日共计28篇
Transformer(1篇)
【1】 Biomedical Data-to-Text Generation via Fine-Tuning Transformers 标题:基于微调Transformer的生物医学数据到文本生成 链接:https://arxiv.org/abs/2109.01518
作者:Ruslan Yermakov,Nicholas Drago,Angelo Ziletti 机构:Decision Science, & Advanced Analytics, Bayer AG, Regulatory Policy, and Intelligence 备注:Accepted at ACL-INGL2021 (International Conference on Natural Language Generation organised by the Association for Computational Linguistics) 摘要:生物医学领域中的数据到文本(D2T)生成是一个很有前途的研究领域,但大部分尚未开发。在这里,我们将D2T生成的神经模型应用于由欧洲药品包装传单组成的真实数据集。我们表明,微调Transformer能够从生物医学领域的数据中生成真实、多段的文本,但也有重要的局限性。我们还发布了一个新的数据集(生物传单),用于在生物医学领域对D2T生成模型进行基准测试。 摘要:Data-to-text (D2T) generation in the biomedical domain is a promising - yet mostly unexplored - field of research. Here, we apply neural models for D2T generation to a real-world dataset consisting of package leaflets of European medicines. We show that fine-tuned transformers are able to generate realistic, multisentence text from data in the biomedical domain, yet have important limitations. We also release a new dataset (BioLeaflets) for benchmarking D2T generation models in the biomedical domain.
BERT(1篇)
【1】 A Context-Aware Hierarchical BERT Fusion Network for Multi-turn Dialog Act Detection 标题:用于多轮对话行为检测的上下文感知分层BERT融合网络 链接:https://arxiv.org/abs/2109.01267
作者:Ting-Wei Wu,Ruolin Su,Biing-Hwang Juang 机构:Georgia Institute of Technology 备注:Published at Interspeech 2021 摘要:交互式对话系统的成功通常与口语理解(SLU)任务的质量有关,SLU任务主要识别每个回合中相应的对话动作和时隙值。通过孤立地处理话语,大多数SLU系统往往忽略了预期对话行为的语义上下文。轮次之间的动作依赖性是非常重要的,但对于识别正确的语义表示至关重要。以往的研究表明,有限的语境感知暴露了在处理复杂的用户意图方面的不足,这些意图在回合转换过程中会发生自发的变化。在这项工作中,我们建议在多轮对话中增强SLU,采用上下文感知的分层BERT融合网络(CaBERT SLU),不仅可以识别对话中的上下文信息,还可以联合识别每个话语中的多个对话动作和时隙。实验结果表明,我们的方法达到新的最先进的(SOTA)性能在两个复杂的多匝对话数据集相比,与以往的方法相比,有很大的改进,只考虑单一话语多意图和槽填充。 摘要:The success of interactive dialog systems is usually associated with the quality of the spoken language understanding (SLU) task, which mainly identifies the corresponding dialog acts and slot values in each turn. By treating utterances in isolation, most SLU systems often overlook the semantic context in which a dialog act is expected. The act dependency between turns is non-trivial and yet critical to the identification of the correct semantic representations. Previous works with limited context awareness have exposed the inadequacy of dealing with complexity in multiproned user intents, which are subject to spontaneous change during turn transitions. In this work, we propose to enhance SLU in multi-turn dialogs, employing a context-aware hierarchical BERT fusion Network (CaBERT-SLU) to not only discern context information within a dialog but also jointly identify multiple dialog acts and slots in each utterance. Experimental results show that our approach reaches new state-of-the-art (SOTA) performances in two complicated multi-turn dialogue datasets with considerable improvements compared with previous methods, which only consider single utterances for multiple intents and slot filling.
QA|VQA|问答|对话(1篇)
【1】 Challenges in Generalization in Open Domain Question Answering 标题:开放领域问答中泛化面临的挑战 链接:https://arxiv.org/abs/2109.01156
作者:Linqing Liu,Patrick Lewis,Sebastian Riedel,Pontus Stenetorp 机构:†University College London, ‡Facebook AI Research 摘要:最近关于开放领域问题回答的研究表明,新的测试问题和那些与训练问题有很大重叠的问题在模型性能上存在很大差异。然而,目前尚不清楚新问题的哪些方面使其具有挑战性。基于对系统泛化的研究,我们根据三个类别引入和注释问题,这三个类别衡量不同的泛化水平和类型:训练集重叠、合成泛化(comp-gen)和新实体泛化(新实体)。在评估六种流行的参数和非参数模型时,我们发现,对于已建立的自然问题和TriviaQA数据集,与完整测试集相比,即使是comp-gen/novel entity最强大的模型性能也分别降低了13.1/5.4%和9.6/1.5%——这表明了这些类型问题所带来的挑战。此外,我们还表明,虽然非参数模型可以处理包含新实体的问题,但它们与那些需要合成泛化的问题相抗争。通过深入分析,我们发现关键的问题难度因素是:检索组件的级联错误、问题模式的频率和实体的频率。 摘要:Recent work on Open Domain Question Answering has shown that there is a large discrepancy in model performance between novel test questions and those that largely overlap with training questions. However, it is as of yet unclear which aspects of novel questions that make them challenging. Drawing upon studies on systematic generalization, we introduce and annotate questions according to three categories that measure different levels and kinds of generalization: training set overlap, compositional generalization (comp-gen), and novel entity generalization (novel-entity). When evaluating six popular parametric and non-parametric models, we find that for the established Natural Questions and TriviaQA datasets, even the strongest model performance for comp-gen/novel-entity is 13.1/5.4% and 9.6/1.5% lower compared to that for the full test set -- indicating the challenge posed by these types of questions. Furthermore, we show that whilst non-parametric models can handle questions containing novel entities, they struggle with those requiring compositional generalization. Through thorough analysis we find that key question difficulty factors are: cascading errors from the retrieval component, frequency of question pattern, and frequency of the entity.
机器翻译(1篇)
【1】 Language Modeling, Lexical Translation, Reordering: The Training Process of NMT through the Lens of Classical SMT 标题:语言建模、词汇翻译、重新排序--从经典SMT的镜头看NMT的训练过程 链接:https://arxiv.org/abs/2109.01396
作者:Elena Voita,Rico Sennrich,Ivan Titov 机构:University of Edinburgh, Scotland, University of Amsterdam, Netherlands, University of Zurich, Switzerland 备注:EMNLP 2021 摘要:不同于传统的统计MT将翻译任务分解成不同的单独学习的组件,神经机器翻译使用一个单一的神经网络来模拟整个翻译过程。尽管神经机器翻译是事实上的标准,但在训练过程中,NMT模型如何获得不同的能力,以及如何反映传统SMT中的不同模型,目前尚不清楚。在这项工作中,我们观察了与三个核心SMT组件相关的能力,发现在训练期间,NMT首先关注学习目标端语言建模,然后通过逐字翻译提高翻译质量,最后学习更复杂的重新排序模式。我们证明了这种行为适用于几种模型和语言对。此外,我们解释了这样的理解训练过程可以在实践中是有用的,作为一个例子,显示它可以用来改善香草非自回归神经机器翻译,通过指导教师模型选择。 摘要:Differently from the traditional statistical MT that decomposes the translation task into distinct separately learned components, neural machine translation uses a single neural network to model the entire translation process. Despite neural machine translation being de-facto standard, it is still not clear how NMT models acquire different competences over the course of training, and how this mirrors the different models in traditional SMT. In this work, we look at the competences related to three core SMT components and find that during training, NMT first focuses on learning target-side language modeling, then improves translation quality approaching word-by-word translation, and finally learns more complicated reordering patterns. We show that this behavior holds for several models and language pairs. Additionally, we explain how such an understanding of the training process can be useful in practice and, as an example, show how it can be used to improve vanilla non-autoregressive neural machine translation by guiding teacher model selection.
Graph|知识图谱|Knowledge(2篇)
【1】 CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge 标题:CRAK:一种基于实体知识的常识推理数据集 链接:https://arxiv.org/abs/2109.01653
作者:Yasumasa Onoe,Michael J. Q. Zhang,Eunsol Choi,Greg Durrett 机构:The University of Texas at Austin 摘要:大多数针对常识推理的基准数据集关注日常场景:物理知识,如知道自己可以在瀑布下装满杯子[Talmor et al.,2019],社会知识,如撞人是尴尬的[Sap et al.,2019],以及其他一般情况。然而,基于特定实体的知识,存在着丰富的常识推理空间:例如,判断“哈利波特可以教你如何骑扫帚飞行”这一说法的真实性。模特们能学会以这种方式将实体知识与常识推理结合起来吗?我们介绍了CREAK,一个关于实体知识的常识推理测试平台,将实体的事实检查(哈利波特是一个巫师,擅长骑扫帚)与常识推理(如果你擅长一项技能,你可以教别人如何做)联系起来。我们的数据集由13k个人类编写的关于实体的英语声明组成,这些实体要么是真的,要么是假的,此外还有一个小的对比集。众工可以很容易地提出这些陈述,数据集上的人因绩效很高(高90);我们认为,模型应该能够融合实体知识和常识推理,才能在这方面做得很好。在我们的实验中,我们关注闭卷设置,并观察到在现有事实验证基准上微调的基线模型在吱吱作响。在吱吱声上训练模型可以大幅提高准确性,但仍低于人类的表现。我们的基准提供了一个独特的探索自然语言理解模型,测试其检索事实的能力(例如,谁教在芝加哥大学?)和未陈述的常识知识(例如,管家不吼叫客人)。 摘要:Most benchmark datasets targeting commonsense reasoning focus on everyday scenarios: physical knowledge like knowing that you could fill a cup under a waterfall [Talmor et al., 2019], social knowledge like bumping into someone is awkward [Sap et al., 2019], and other generic situations. However, there is a rich space of commonsense inferences anchored to knowledge about specific entities: for example, deciding the truthfulness of a claim "Harry Potter can teach classes on how to fly on a broomstick." Can models learn to combine entity knowledge with commonsense reasoning in this fashion? We introduce CREAK, a testbed for commonsense reasoning about entity knowledge, bridging fact-checking about entities (Harry Potter is a wizard and is skilled at riding a broomstick) with commonsense inferences (if you're good at a skill you can teach others how to do it). Our dataset consists of 13k human-authored English claims about entities that are either true or false, in addition to a small contrast set. Crowdworkers can easily come up with these statements and human performance on the dataset is high (high 90s); we argue that models should be able to blend entity knowledge and commonsense reasoning to do well here. In our experiments, we focus on the closed-book setting and observe that a baseline model finetuned on existing fact verification benchmark struggles on CREAK. Training a model on CREAK improves accuracy by a substantial margin, but still falls short of human performance. Our benchmark provides a unique probe into natural language understanding models, testing both its ability to retrieve facts (e.g., who teaches at the University of Chicago?) and unstated commonsense knowledge (e.g., butlers do not yell at guests).
【2】 LG4AV: Combining Language Models and Graph Neural Networks for Author Verification 标题:LG4AV:结合语言模型和图神经网络的作者认证 链接:https://arxiv.org/abs/2109.01479
作者:Maximilian Stubbemann,Gerd Stumme 机构:L,S Research Center and University of Kassel, Kassel, Germany, University of Kassel and L,S Research Center 备注:9 pages, 1 figure 摘要:文档作者身份的自动验证在各种设置中都很重要。例如,研究人员根据其出版物的数量和影响进行判断和比较,公众人物面对他们在社交媒体平台上的帖子。因此,经常使用的web服务和平台中的作者信息是正确的是很重要的。给定文档是否由给定作者编写的问题通常称为作者身份验证(AV)。虽然AV通常是一个被广泛研究的问题,但只有很少的作品考虑文档短且写得相当统一的设置。这使得大多数方法在学术领域的在线数据库和知识图中不实用。在这里,必须核实科学出版物的作者身份,通常只有摘要和标题。在这一点上,我们提出了我们的新方法LG4AV,它结合了语言模型和图形神经网络进行作者身份验证。通过在预先训练好的transformer架构中直接输入可用文本,我们的模型不需要任何手工制作的笔迹特征,这些特征在书写风格至少在某种程度上是标准化的场景中没有意义。通过加入图形神经网络结构,我们的模型可以从作者之间的关系中获益,这些关系对于验证过程是有意义的。例如,科学作者更有可能写他们的合著者所涉及的主题,推特用户倾向于发布与他们所关注的人相同的主题。我们通过实验评估了我们的模型,并研究了在文献计量环境中,合作作者的加入在多大程度上增强了验证决策。 摘要:The automatic verification of document authorships is important in various settings. Researchers are for example judged and compared by the amount and impact of their publications and public figures are confronted by their posts on social media platforms. Therefore, it is important that authorship information in frequently used web services and platforms is correct. The question whether a given document is written by a given author is commonly referred to as authorship verification (AV). While AV is a widely investigated problem in general, only few works consider settings where the documents are short and written in a rather uniform style. This makes most approaches unpractical for online databases and knowledge graphs in the scholarly domain. Here, authorships of scientific publications have to be verified, often with just abstracts and titles available. To this point, we present our novel approach LG4AV which combines language models and graph neural networks for authorship verification. By directly feeding the available texts in a pre-trained transformer architecture, our model does not need any hand-crafted stylometric features that are not meaningful in scenarios where the writing style is, at least to some extent, standardized. By the incorporation of a graph neural network structure, our model can benefit from relations between authors that are meaningful with respect to the verification process. For example, scientific authors are more likely to write about topics that are addressed by their co-authors and twitter users tend to post about the same subjects as people they follow. We experimentally evaluate our model and study to which extent the inclusion of co-authorships enhances verification decisions in bibliometric environments.
推理|分析|理解|解释(1篇)
【1】 Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding 标题:从多个噪声增强数据集学习以更好地理解跨语言口语 链接:https://arxiv.org/abs/2109.01583
作者:Yingmei Guo,Linjun Shou,Jian Pei,Ming Gong,Mingxing Xu,Zhiyong Wu,Daxin Jiang 机构:Department of Computer Science and Technology, Tsinghua University, NLP Group, Microsoft STCA, School of Computing Science, Simon Fraser University 备注:Long paper at EMNLP 2021 摘要:缺乏训练数据是将口语理解(SLU)扩展到低资源语言的一大挑战。虽然已经提出了各种数据扩充方法来合成低资源目标语言中的训练数据,但扩充的数据集通常是有噪声的,因此阻碍了SLU模型的性能。在本文中,我们主要关注增强数据中的噪声抑制。我们开发了一种去噪训练方法。多个模型使用各种增强方法生成的数据进行训练。这些模型相互提供监控信号。实验结果表明,在两个基准数据集上,我们的方法分别比现有的最新技术高3.05和4.24个百分点。该代码将在github上开源。 摘要:Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages. Although various data augmentation approaches have been proposed to synthesize training data in low-resource target languages, the augmented data sets are often noisy, and thus impede the performance of SLU models. In this paper we focus on mitigating noise in augmented data. We develop a denoising training approach. Multiple models are trained with data produced by various augmented methods. Those models provide supervision signals to each other. The experimental results show that our method outperforms the existing state of the art by 3.05 and 4.24 percentage points on two benchmark datasets, respectively. The code will be made open sourced on github.
GAN|对抗|攻击|生成相关(2篇)
【1】 Contrastive Representation Learning for Exemplar-Guided Paraphrase Generation 标题:基于对比表征学习的范例引导释义生成 链接:https://arxiv.org/abs/2109.01484
作者:Haoran Yang,Wai Lam,Piji Li 机构:The Chinese University of Hong Kong, Tencent AI Lab 备注:Findings of EMNLP 2021 摘要:范例引导释义生成(EGPG)旨在生成符合给定范例风格的目标句子,同时封装源句子的内容信息。在本文中,我们提出了一种新的方法,目的是学习更好的风格和内容表示。该方法的主要动机是对比学习的最新成功,对比学习在无监督的特征提取任务中显示了其强大的能力。其思想是通过考虑训练过程中的两个问题特征,在内容和风格方面设计两个对比损失。一个特点是目标句与源句具有相同的内容,第二个特点是目标句与范例句具有相同的风格。这两种对比损失被合并到通用编码器-解码器范例中。在两个数据集上的实验,即QQP Pos和ParaNMT,证明了我们提出的压缩损失的有效性。 摘要:Exemplar-Guided Paraphrase Generation (EGPG) aims to generate a target sentence which conforms to the style of the given exemplar while encapsulating the content information of the source sentence. In this paper, we propose a new method with the goal of learning a better representation of the style andthe content. This method is mainly motivated by the recent success of contrastive learning which has demonstrated its power in unsupervised feature extraction tasks. The idea is to design two contrastive losses with respect to the content and the style by considering two problem characteristics during training. One characteristic is that the target sentence shares the same content with the source sentence, and the second characteristic is that the target sentence shares the same style with the exemplar. These two contrastive losses are incorporated into the general encoder-decoder paradigm. Experiments on two datasets, namely QQP-Pos and ParaNMT, demonstrate the effectiveness of our proposed constrastive losses.
【2】 Multimodal Conditionality for Natural Language Generation 标题:自然语言生成中的多模态条件性 链接:https://arxiv.org/abs/2109.01229
作者:Michael Sollami,Aashish Jain 摘要:大规模的预训练语言模型在语言理解任务中表现出了最先进的性能。它们的应用最近扩展到多模态学习,从而改进了视觉和语言相结合的表征。然而,使语言模型适应条件自然语言生成(NLG)的进展仅限于单一模态,通常是文本。我们提出了MAnTiS,文本合成的多模态适应,这是一种基于变换的NLG模型中多模态条件的通用方法。在该方法中,我们将来自每个模态的输入通过特定于模态的编码器传递,投射到文本标记空间,最后连接形成条件前缀。我们使用条件前缀对预训练语言模型和编码器进行微调,以指导生成。我们将螳螂应用于产品描述生成任务,在产品图像和标题上调节网络以生成描述性文本。我们证明,螳螂在标准NLG评分指标上优于强基线方法。此外,定性评估表明,螳螂可以生成与给定多模态输入一致的人类质量描述。 摘要:Large scale pretrained language models have demonstrated state-of-the-art performance in language understanding tasks. Their application has recently expanded into multimodality learning, leading to improved representations combining vision and language. However, progress in adapting language models towards conditional Natural Language Generation (NLG) has been limited to a single modality, generally text. We propose MAnTiS, Multimodal Adaptation for Text Synthesis, a general approach for multimodal conditionality in transformer-based NLG models. In this method, we pass inputs from each modality through modality-specific encoders, project to textual token space, and finally join to form a conditionality prefix. We fine-tune the pretrained language model and encoders with the conditionality prefix guiding the generation. We apply MAnTiS to the task of product description generation, conditioning a network on both product images and titles to generate descriptive text. We demonstrate that MAnTiS outperforms strong baseline approaches on standard NLG scoring metrics. Furthermore, qualitative assessments demonstrate that MAnTiS can generate human quality descriptions consistent with given multimodal inputs.
半/弱/无监督|不确定性(1篇)
【1】 Entity Linking and Discovery via Arborescence-based Supervised Clustering 标题:基于树形监督聚类的实体链接与发现 链接:https://arxiv.org/abs/2109.01242
作者:Dhruv Agarwal,Rico Angell,Nicholas Monath,Andrew McCallum 机构:College of Information and Computer Sciences, University of Massachusetts Amherst 摘要:以前的工作表明,通过不仅测量提及和实体之间的亲和力,而且还测量提及之间的亲和力,在执行实体链接方面取得了有希望的结果。在本文中,我们提出了一种新的训练和推理程序,该程序通过在文档中建立关于提及和实体的最小树状图(即有向生成树)来充分利用提及之间的亲和力,从而做出链接决策。我们还表明,该方法可以很好地扩展到实体发现,支持对知识库中没有关联实体的提及进行聚类。我们在Zero-Shot实体链接数据集和MedInetries上评估了我们的方法,MedInetries是最大的公开生物医学数据集,与相同的参数化模型相比,实体链接和发现的性能都有显著提高。我们进一步表明,与以前的工作相比,使用计算成本更高的模型,效率有了显著的提高,但精度损失很小。 摘要:Previous work has shown promising results in performing entity linking by measuring not only the affinities between mentions and entities but also those amongst mentions. In this paper, we present novel training and inference procedures that fully utilize mention-to-mention affinities by building minimum arborescences (i.e., directed spanning trees) over mentions and entities across documents in order to make linking decisions. We also show that this method gracefully extends to entity discovery, enabling the clustering of mentions that do not have an associated entity in the knowledge base. We evaluate our approach on the Zero-Shot Entity Linking dataset and MedMentions, the largest publicly available biomedical dataset, and show significant improvements in performance for both entity linking and discovery compared to identically parameterized models. We further show significant efficiency improvements with only a small loss in accuracy over previous work, which use more computationally expensive models.
检测相关(1篇)
【1】 Detecting Speaker Personas from Conversational Texts 标题:从会话文本中检测说话人角色 链接:https://arxiv.org/abs/2109.01330
作者:Jia-Chen Gu,Zhen-Hua Ling,Yu Wu,Quan Liu,Zhigang Chen,Xiaodan Zhu 机构:National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, China, Microsoft Research Asia, Beijing, China, State Key Laboratory of Cognitive Intelligence, iFLYTEK Research, Hefei, China 备注:Accepted by EMNLP 2021 摘要:人物角色有助于预测对话反应。然而,当前研究中使用的人物角色是预定义的,在对话之前很难获得。为了解决这个问题,我们研究了一个新的任务,即说话人角色检测(SPD),该任务旨在检测基于普通会话文本的说话人角色。在这个任务中,从给定对话文本的候选人物中搜索出一个最匹配的人物角色。这是一项多对多语义匹配任务,因为SPD中的上下文和角色都由多个句子组成。这些句子之间的长期依赖性和动态冗余增加了这项任务的难度。我们为SPD构建了一个数据集,称为“角色聊天中的角色匹配”(PMPC)。此外,我们评估了几种基线模型,并针对这项任务提出了话语到轮廓(U2P)匹配网络。U2P模型以精细的粒度运行,将上下文和角色视为多个序列的集合。然后,对每个序列对进行评分,并通过聚合获得上下文角色对的可解释总分。评估结果表明,U2P模型的表现明显优于其基线模型。 摘要:Personas are useful for dialogue response prediction. However, the personas used in current studies are pre-defined and hard to obtain before a conversation. To tackle this issue, we study a new task, named Speaker Persona Detection (SPD), which aims to detect speaker personas based on the plain conversational text. In this task, a best-matched persona is searched out from candidates given the conversational text. This is a many-to-many semantic matching task because both contexts and personas in SPD are composed of multiple sentences. The long-term dependency and the dynamic redundancy among these sentences increase the difficulty of this task. We build a dataset for SPD, dubbed as Persona Match on Persona-Chat (PMPC). Furthermore, we evaluate several baseline models and propose utterance-to-profile (U2P) matching networks for this task. The U2P models operate at a fine granularity which treat both contexts and personas as sets of multiple sequences. Then, each sequence pair is scored and an interpretable overall score is obtained for a context-persona pair through aggregation. Evaluation results show that the U2P models outperform their baseline counterparts significantly.
识别/分类(4篇)
【1】 Empirical Study of Named Entity Recognition Performance Using Distribution-aware Word Embedding 标题:分布感知词嵌入对命名实体识别性能的实证研究 链接:https://arxiv.org/abs/2109.01636
作者:Xin Chen,Qi Zhao,Xinyang Liu 机构:ISE Department, University of Illinois, at Urbana-Champaign 摘要:随着深度学习技术的快速发展,命名实体识别(NER)在信息抽取任务中的地位越来越重要。NER任务面临的最大困难是即使在网元和文档类型不熟悉的情况下也要保持可检测性。考虑到特定性信息可能包含单词的潜在含义并生成与单词嵌入相关的语义特征,我们开发了一种分布感知的单词嵌入,并在NER框架中实现了三种不同的方法来利用分布信息。结果表明,如果在现有的NER方法中加入单词特异性,NER的性能将得到改善。 摘要:With the fast development of Deep Learning techniques, Named Entity Recognition (NER) is becoming more and more important in the information extraction task. The greatest difficulty that the NER task faces is to keep the detectability even when types of NE and documents are unfamiliar. Realizing that the specificity information may contain potential meanings of a word and generate semantic-related features for word embedding, we develop a distribution-aware word embedding and implement three different methods to make use of the distribution information in a NER framework. And the result shows that the performance of NER will be improved if the word specificity is incorporated into existing NER methods.
【2】 Contextualized Embeddings based Convolutional Neural Networks for Duplicate Question Identification 标题:基于上下文嵌入的卷积神经网络重复问题识别 链接:https://arxiv.org/abs/2109.01560
作者:Harsh Sakhrani,Saloni Parekh,Pratik Ratadiya 机构:Pune Institute of Computer Technology, Maharashtra, India, CreaTek Consulting Services Pvt. Ltd., Maharashtra, India 备注:"accepted at 30th International Joint Conference on Artificial Intelligence (IJCAI-ASEA) 2021: Workshop on Applied Semantics Extraction and Analytics", 7 pages with 3 figures 摘要:问题释义识别(QPI)是大型问答论坛的关键任务。QPI的目的是确定一对给定的问题在语义上是否相同。以前用于这项任务的方法已经产生了有希望的结果,但通常依赖于复杂的复发机制,这在本质上是昂贵和耗时的。在本文中,我们提出了一种结合双向Transformer编码器和卷积神经网络的QPI任务的新结构。我们使用两种不同的推理设置(暹罗和匹配聚合)从所提出的体系结构生成预测。实验结果表明,我们的模型在Quora问题对数据集上达到了最先进的性能。我们经验证明,在模型架构中添加卷积层可以改善两种推理设置的结果。我们还研究了部分和完全微调的影响,并分析了该过程中计算能力和精度之间的权衡。根据获得的结果,我们得出结论,匹配聚合设置始终优于暹罗设置。我们的工作深入了解了哪些架构组合和设置可能为QPI任务产生更好的结果。 摘要:Question Paraphrase Identification (QPI) is a critical task for large-scale Question-Answering forums. The purpose of QPI is to determine whether a given pair of questions are semantically identical or not. Previous approaches for this task have yielded promising results, but have often relied on complex recurrence mechanisms that are expensive and time-consuming in nature. In this paper, we propose a novel architecture combining a Bidirectional Transformer Encoder with Convolutional Neural Networks for the QPI task. We produce the predictions from the proposed architecture using two different inference setups: Siamese and Matched Aggregation. Experimental results demonstrate that our model achieves state-of-the-art performance on the Quora Question Pairs dataset. We empirically prove that the addition of convolution layers to the model architecture improves the results in both inference setups. We also investigate the impact of partial and complete fine-tuning and analyze the trade-off between computational power and accuracy in the process. Based on the obtained results, we conclude that the Matched-Aggregation setup consistently outperforms the Siamese setup. Our work provides insights into what architecture combinations and setups are likely to produce better results for the QPI task.
【3】 An Open-Source Dataset and A Multi-Task Model for Malay Named Entity Recognition 标题:用于马来语命名实体识别的开源数据集和多任务模型 链接:https://arxiv.org/abs/2109.01293
作者:Yingwen Fu,Nankai Lin,Zhihe Yang,Shengyi Jiang 机构:School of Information Science and Technology, Guangdong University of Foreign Studies, Guangdong, China, Guangzhou Key Laboratory of Multilingual Intelligent Processing, Guangdong University of Foreign Studies, Guangzhou, China 摘要:命名实体识别(NER)是自然语言处理(NLP)的一项基本任务。然而,大多数最新的研究主要针对高资源语言,如英语,尚未广泛应用于低资源语言。在马来语中,相关的NER资源是有限的。在这项工作中,我们提出了一个基于同源语言标记数据集和迭代优化的数据集构建框架,以构建一个包含28991个句子(超过384000个标记)的马来语NER数据集(MYNER)。此外,为了更好地整合NER的边界信息,我们提出了一个多任务(MT)模型,该模型具有马来语NER任务的双向修订(Bi修订)机制。具体而言,引入了一个辅助任务,即边界检测,以改进显式和隐式的NER训练。此外,提出了一种门控忽略机制来进行条件标签转移,并通过辅助任务减轻错误传播。实验结果表明,我们的模型在MYNER的基线上取得了可比的结果。本文中的数据集和模型将作为基准数据集公开发布。 摘要:Named entity recognition (NER) is a fundamental task of natural language processing (NLP). However, most state-of-the-art research is mainly oriented to high-resource languages such as English and has not been widely applied to low-resource languages. In Malay language, relevant NER resources are limited. In this work, we propose a dataset construction framework, which is based on labeled datasets of homologous languages and iterative optimization, to build a Malay NER dataset (MYNER) comprising 28,991 sentences (over 384 thousand tokens). Additionally, to better integrate boundary information for NER, we propose a multi-task (MT) model with a bidirectional revision (Bi-revision) mechanism for Malay NER task. Specifically, an auxiliary task, boundary detection, is introduced to improve NER training in both explicit and implicit ways. Furthermore, a gated ignoring mechanism is proposed to conduct conditional label transfer and alleviate error propagation by the auxiliary task. Experimental results demonstrate that our model achieves comparable results over baselines on MYNER. The dataset and the model in this paper would be publicly released as a benchmark dataset.
【4】 Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition 标题:高效仿真器:自动语音识别中的渐进式下采样和分组注意 链接:https://arxiv.org/abs/2109.01163
作者:Maxime Burchi,Valentin Vielzeuf 机构:Orange Labs, Cesson-S´evign´e, France 备注:None 摘要:最近提出的Conformer体系结构通过结合卷积和对模型局部和全局依赖性的关注,在自动语音识别中显示了最先进的性能。在本文中,我们研究如何在有限的计算预算下降低Conformer体系结构的复杂性,从而实现一种更高效的体系结构设计,我们称之为高效Conformer。我们在构象编码器中引入渐进下采样,并提出了一种新的注意机制,称为分组注意,允许我们将注意复杂度从$O(n^{2}d)$降低到$O(n^{2}d/g)$,用于序列长度$n$、隐藏维度$d$和组大小参数$g$。我们还实验了使用跨步多头自我注意作为一种全局下采样操作。我们的实验是在LibriSpeech数据集上进行的,带有CTC和RNN传感器损耗。我们表明,在相同的计算预算下,与一致性架构相比,该架构在更快的训练和解码速度下实现了更好的性能。我们的13M参数CTC模型在不使用语言模型的情况下实现了3.6%/9.0%的有竞争力的WER,在测试清洁/测试其他集合时,使用外部n-gram语言模型实现了2.7%/6.7%,同时在推理时比我们的CTC一致性基线快29%,训练速度快36%。 摘要:The recently proposed Conformer architecture has shown state-of-the-art performances in Automatic Speech Recognition by combining convolution with attention to model both local and global dependencies. In this paper, we study how to reduce the Conformer architecture complexity with a limited computing budget, leading to a more efficient architecture design that we call Efficient Conformer. We introduce progressive downsampling to the Conformer encoder and propose a novel attention mechanism named grouped attention, allowing us to reduce attention complexity from $O(n^{2}d)$ to $O(n^{2}d / g)$ for sequence length $n$, hidden dimension $d$ and group size parameter $g$. We also experiment the use of strided multi-head self-attention as a global downsampling operation. Our experiments are performed on the LibriSpeech dataset with CTC and RNN-Transducer losses. We show that within the same computing budget, the proposed architecture achieves better performances with faster training and decoding compared to the Conformer. Our 13M parameters CTC model achieves competitive WERs of 3.6%/9.0% without using a language model and 2.7%/6.7% with an external n-gram language model on the test-clean/test-other sets while being 29% faster than our CTC Conformer baseline at inference and 36% faster to train.
Zero/Few/One-Shot|迁移|自适应(2篇)
【1】 Finetuned Language Models Are Zero-Shot Learners 标题:精通语言的典范是零杆学习者 链接:https://arxiv.org/abs/2109.01652
作者:Jason Wei,Maarten Bosma,Vincent Y. Zhao,Kelvin Guu,Adams Wei Yu,Brian Lester,Nan Du,Andrew M. Dai,Quoc V. Le 机构:Google Research 摘要:本文探讨了一种提高语言模型零炮学习能力的简单方法。我们展示了指令调优——对通过指令描述的一组任务进行精细调优的语言模型——在看不见的任务上显著提高了零炮性能。我们采用137B参数预训练语言模型,并在通过自然语言教学模板表达的60多个NLP任务上对其进行教学调整。我们对这个指令优化模型(我们称之为FLAN)进行评估,评估的是看不见的任务类型。FLAN大大提高了其未经修改的同类产品的性能,在我们评估的25项任务中,有19项超过了零炮175B GPT-3。FLAN甚至在ANLI、RTE、BoolQ、AI2-ARC、OpenbookQA和StoryCloze上的表现都大大超过了几杆GPT-3。消融研究表明,任务数量和模型规模是教学调整成功的关键因素。 摘要:This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially boosts zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 19 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of tasks and model scale are key components to the success of instruction tuning.
【2】 Information Symmetry Matters: A Modal-Alternating Propagation Network for Few-Shot Learning 标题:信息对称问题:一种用于少发学习的模态交替传播网络 链接:https://arxiv.org/abs/2109.01295
作者:Zhong Ji,Zhishen Hou,Xiyao Liu,Yanwei Pang,Jungong Han 摘要:语义信息提供了超越视觉概念的类内一致性和类间可辨别性,这已被用于少数镜头学习(FSL)以实现进一步的收益。然而,语义信息只适用于标记样本,而不适用于未标记样本,通过语义引导少数标记样本单方面纠正嵌入。因此,语义引导样本和非语义引导样本之间不可避免地存在跨模态偏差,从而导致信息不对称问题。为了解决这一问题,我们提出了一种模式交替传播网络(MAP-Net)来补充未标记样本的缺失语义信息,该网络在视觉和语义模式中建立了所有样本之间的信息对称性。具体地说,地图网络通过图传播来传输邻居信息,在完成的视觉关系的指导下生成未标记样本的伪语义,并校正特征嵌入。此外,由于视觉和语义模式之间的巨大差异,我们设计了一种关系引导(RG)策略,通过语义来引导视觉关系向量,从而使传播的信息更加有利。在加州理工大学UCSD Birds 200-2011、SUN属性数据库和牛津102 Flower三个语义标记数据集上的大量实验结果表明,我们提出的方法取得了令人满意的性能,优于最先进的方法,这表明了信息对称的必要性。 摘要:Semantic information provides intra-class consistency and inter-class discriminability beyond visual concepts, which has been employed in Few-Shot Learning (FSL) to achieve further gains. However, semantic information is only available for labeled samples but absent for unlabeled samples, in which the embeddings are rectified unilaterally by guiding the few labeled samples with semantics. Therefore, it is inevitable to bring a cross-modal bias between semantic-guided samples and nonsemantic-guided samples, which results in an information asymmetry problem. To address this problem, we propose a Modal-Alternating Propagation Network (MAP-Net) to supplement the absent semantic information of unlabeled samples, which builds information symmetry among all samples in both visual and semantic modalities. Specifically, the MAP-Net transfers the neighbor information by the graph propagation to generate the pseudo-semantics for unlabeled samples guided by the completed visual relationships and rectify the feature embeddings. In addition, due to the large discrepancy between visual and semantic modalities, we design a Relation Guidance (RG) strategy to guide the visual relation vectors via semantics so that the propagated information is more beneficial. Extensive experimental results on three semantic-labeled datasets, i.e., Caltech-UCSD-Birds 200-2011, SUN Attribute Database, and Oxford 102 Flower, have demonstrated that our proposed method achieves promising performance and outperforms the state-of-the-art approaches, which indicates the necessity of information symmetry.
检索(1篇)
【1】 Cross-Lingual Training with Dense Retrieval for Document Retrieval 标题:面向文档检索的密集检索跨语言训练 链接:https://arxiv.org/abs/2109.01628
作者:Peng Shi,Rui Zhang,He Bai,Jimmy Lin 机构:David R. Cheriton School of Computer Science, University of Waterloo, Department of Computer Science and Engineering, Penn State University 摘要:密集检索在英语文章排名中取得了巨大成功。然而,由于训练资源的限制,它在非英语文档检索中的有效性仍然没有得到探索。在这项工作中,我们探讨了从英语注释到多种非英语语言的文档排序的不同转换技术。我们对来自不同语系的六种语言(汉语、阿拉伯语、法语、印地语、孟加拉语、西班牙语)的测试集进行的实验表明,使用mBERT的基于Zero-Shot模型的迁移提高了非英语单语检索的搜索质量。此外,我们还发现,弱监督的目标语言传输与需要外部翻译人员和查询生成器的基于生成的目标语言传输相比具有竞争力。 摘要:Dense retrieval has shown great success in passage ranking in English. However, its effectiveness in document retrieval for non-English languages remains unexplored due to the limitation in training resources. In this work, we explore different transfer techniques for document ranking from English annotations to multiple non-English languages. Our experiments on the test collections in six languages (Chinese, Arabic, French, Hindi, Bengali, Spanish) from diverse language families reveal that zero-shot model-based transfer using mBERT improves the search quality in non-English mono-lingual retrieval. Also, we find that weakly-supervised target language transfer yields competitive performances against the generation-based target language transfer that requires external translators and query generators.
Word2Vec|文本|单词(1篇)
【1】 An Empirical Study on Leveraging Position Embeddings for Target-oriented Opinion Words Extraction 标题:利用位置杠杆嵌入进行目标导向观点词抽取的实证研究 链接:https://arxiv.org/abs/2109.01238
作者:Samuel Mensah,Kai Sun,Nikolaos Aletras 机构:Computer Science Department, University of Sheffield, UK, BDBC and SKLSDE, Beihang University, China 备注:Accepted at EMNLP 2021 摘要:面向目标的意见词提取(TOWE)(Fan et al.,2019b)是面向目标的情感分析的一个新子任务,旨在提取文本中给定方面的意见词。当前最先进的方法利用位置嵌入来捕获单词与目标的相对位置。然而,这些方法的性能取决于将这些信息合并到单词表示中的能力。在本文中,我们探索了各种基于预训练单词嵌入或利用词性和位置嵌入的语言模型的文本编码器,旨在检查TOWE中每个组件的实际贡献。我们还采用了一种图卷积网络(GCN),通过加入句法信息来增强单词表示。我们的实验结果表明,基于BiLSTM的模型可以有效地将位置信息编码为单词表示,而使用GCN只能获得边际收益。有趣的是,我们的简单方法优于几种最先进的复杂神经结构。 摘要:Target-oriented opinion words extraction (TOWE) (Fan et al., 2019b) is a new subtask of target-oriented sentiment analysis that aims to extract opinion words for a given aspect in text. Current state-of-the-art methods leverage position embeddings to capture the relative position of a word to the target. However, the performance of these methods depends on the ability to incorporate this information into word representations. In this paper, we explore a variety of text encoders based on pretrained word embeddings or language models that leverage part-of-speech and position embeddings, aiming to examine the actual contribution of each component in TOWE. We also adapt a graph convolutional network (GCN) to enhance word representations by incorporating syntactic information. Our experimental results demonstrate that BiLSTM-based models can effectively encode position information into word representations while using a GCN only achieves marginal gains. Interestingly, our simple methods outperform several state-of-the-art complex neural structures.
其他神经网络|深度学习|模型|建模(4篇)
【1】 Learning Neural Models for Natural Language Processing in the Face of Distributional Shift 标题:面向分布漂移的自然语言处理学习神经模型 链接:https://arxiv.org/abs/2109.01558
作者:Paul Michel 机构:CMU-LTI-,-, Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA , Thesis Committee:, Graham Neubig (Chair), Zico Kolter, Zachary Lipton, Tatsunori Hashimoto, Stanford University 备注:PhD thesis 摘要:主要的NLP范式是训练一个强大的神经预测器在特定数据集上执行一项任务,这一范式在各种应用(例如情感分类、基于广度预测的问答或机器翻译)中产生了最先进的性能。然而,它建立在数据分布是平稳的假设之上,即数据在训练和测试时都是从固定分布中采样的。这种训练方式与我们人类如何在不断变化的信息流中学习和操作是不一致的。此外,它不适合真实世界中的用例,在这些用例中,数据分布预计会在模型的生命周期内发生变化。本文的第一个目标是描述这种转变在自然语言处理环境中可能采取的不同形式,并提出基准和评估指标来衡量其对当前深度学习体系结构的影响。然后,我们继续采取措施减轻分配转移对NLP模型的影响。为此,我们开发了基于分布鲁棒优化框架的参数重新表述的方法。在经验上,我们证明了这些方法可以产生更稳健的模型,正如在一系列现实问题上所证明的那样。在本文的第三部分,也就是最后一部分,我们探讨了如何有效地使现有模型适应新的领域或任务。我们对这一主题的贡献从信息几何中得到启发,导出了一个新的梯度更新规则,该规则可以缓解适应过程中的灾难性遗忘问题。 摘要:The dominating NLP paradigm of training a strong neural predictor to perform one task on a specific dataset has led to state-of-the-art performance in a variety of applications (eg. sentiment classification, span-prediction based question answering or machine translation). However, it builds upon the assumption that the data distribution is stationary, ie. that the data is sampled from a fixed distribution both at training and test time. This way of training is inconsistent with how we as humans are able to learn from and operate within a constantly changing stream of information. Moreover, it is ill-adapted to real-world use cases where the data distribution is expected to shift over the course of a model's lifetime. The first goal of this thesis is to characterize the different forms this shift can take in the context of natural language processing, and propose benchmarks and evaluation metrics to measure its effect on current deep learning architectures. We then proceed to take steps to mitigate the effect of distributional shift on NLP models. To this end, we develop methods based on parametric reformulations of the distributionally robust optimization framework. Empirically, we demonstrate that these approaches yield more robust models as demonstrated on a selection of realistic problems. In the third and final part of this thesis, we explore ways of efficiently adapting existing models to new domains or tasks. Our contribution to this topic takes inspiration from information geometry to derive a new gradient update rule which alleviate catastrophic forgetting issues during adaptation.
【2】 Do Prompt-Based Models Really Understand the Meaning of their Prompts? 标题:基于提示的模型真的理解其提示的含义吗? 链接:https://arxiv.org/abs/2109.01247
作者:Albert Webson,Ellie Pavlick 机构:Department of Computer Science, Brown University, Department of Philosophy, Brown University 备注:Code available at this https URL 摘要:最近,大量的论文表明,在使用各种基于提示的模型进行少数镜头学习方面取得了非凡的进展。这样的成功给人的印象是,提示可以帮助模型更快地学习,就像人类在使用自然语言表达的任务指令时学习更快一样。在这项研究中,我们实验了30多个为自然语言推理(NLI)手工编写的提示。我们发现,模型学习的速度与许多有意无关甚至病态误导的提示一样快,就像它们学习有指导意义的“好”提示一样。此外,我们发现模型性能更依赖于LM目标词的选择(即将LM词汇预测转换为类标签的“言语化器”),而不是提示本身的文本。总之,我们发现几乎没有证据表明现有的基于提示的模型真正理解其给定提示的含义。 摘要:Recently, a boom of papers have shown extraordinary progress in few-shot learning with various prompt-based models. Such success can give the impression that prompts help models to learn faster in the same way that humans learn faster when provided with task instructions expressed in natural language. In this study, we experiment with over 30 prompts manually written for natural language inference (NLI). We find that models learn just as fast with many prompts that are intentionally irrelevant or even pathologically misleading as they do with instructively "good" prompts. Additionally, we find that model performance is more dependent on the choice of the LM target words (a.k.a. the "verbalizer" that converts LM vocabulary prediction to class labels) than on the text of the prompt itself. In sum, we find little evidence that suggests existing prompt-based models truly understand the meaning of their given prompts.
【3】 Establishing Interlingua in Multilingual Language Models 标题:在多语种语言模型中建立中间语言 链接:https://arxiv.org/abs/2109.01207
作者:Maksym Del,Mark Fishel 机构:Institute of Computer Science, University of Tartu, Estonia 备注:8 pages, 10 figures 摘要:大型多语种语言模型在一系列任务中表现出显著的零次跨语种迁移性能。后续工作假设这些模型在内部将不同语言的表达投射到一个共享的语际空间。然而,它们产生了相互矛盾的结果。在本文中,我们纠正了以前的一个著名著作,即声称“BERT不是中间语”的著名著作,并表明在适当选择句子表示法的情况下,不同的语言实际上会在这样的语言模型中收敛到一个共享空间。此外,我们还证明了这种收敛模式在四种相关相似性度量和六种类似mBERT的模型中是鲁棒的。然后,我们将分析扩展到28种不同的语言,发现语际空间呈现出一种特殊的结构,类似于语言的语言关联性。我们还强调了一些似乎无法收敛到共享空间的异常语言。复制结果的代码位于以下URL:https://github.com/maksym-del/interlingua. 摘要:Large multilingual language models show remarkable zero-shot cross-lingual transfer performance on a range of tasks. Follow-up works hypothesized that these models internally project representations of different languages into a shared interlingual space. However, they produced contradictory results. In this paper, we correct %one of the previous works the famous prior work claiming that "BERT is not an Interlingua" and show that with the proper choice of sentence representation different languages actually do converge to a shared space in such language models. Furthermore, we demonstrate that this convergence pattern is robust across four measures of correlation similarity and six mBERT-like models. We then extend our analysis to 28 diverse languages and find that the interlingual space exhibits a particular structure similar to the linguistic relatedness of languages. We also highlight a few outlier languages that seem to fail to converge to the shared space. The code for replicating our results is available at the following URL: https://github.com/maksym-del/interlingua.
【4】 Ranking Scientific Papers Using Preference Learning 标题:基于偏好学习的科技论文排序 链接:https://arxiv.org/abs/2109.01190
作者:Nils Dycke,Edwin Simpson,Ilia Kuznetsov,Iryna Gurevych 机构:Ubiquitous Knowledge Processing Lab (UKP-TUDA), Department of Computer Science, Technische Universit¨at Darmstadt., ukp.informatik.tu-darmstadt.de, Intelligent Systems Lab, University of Bristol 摘要:同行评议是学术界主要的质量控制机制。科学工作的质量有许多方面;再加上复习任务的主观性,这使得基于复习和成绩的最终决策非常困难和耗时。为了帮助完成这项重要任务,我们将其作为一个基于同行评议文本和评论员分数的论文排名问题。我们引入了一个新的、多方面的通用评估框架,用于在同行评审的基础上做出最终决策,该框架考虑了被评估系统的有效性、效率和公平性。我们提出了一种基于高斯过程偏好学习(GPPL)的论文排名新方法,并在ACL-2018会议的同行评审数据上对其进行了评估。我们的实验证明了基于GPPL的方法优于之前的工作,同时强调了在同行评议汇总过程中使用文本和评议分数进行论文排名的重要性。 摘要:Peer review is the main quality control mechanism in academia. Quality of scientific work has many dimensions; coupled with the subjective nature of the reviewing task, this makes final decision making based on the reviews and scores therein very difficult and time-consuming. To assist with this important task, we cast it as a paper ranking problem based on peer review texts and reviewer scores. We introduce a novel, multi-faceted generic evaluation framework for making final decisions based on peer reviews that takes into account effectiveness, efficiency and fairness of the evaluated system. We propose a novel approach to paper ranking based on Gaussian Process Preference Learning (GPPL) and evaluate it on peer review data from the ACL-2018 conference. Our experiments demonstrate the superiority of our GPPL-based approach over prior work, while highlighting the importance of using both texts and review scores for paper ranking during peer review aggregation.
其他(5篇)
【1】 A Longitudinal Multi-modal Dataset for Dementia Monitoring and Diagnosis 标题:一种用于痴呆监测和诊断的纵向多模态数据集 链接:https://arxiv.org/abs/2109.01537
作者:Dimitris Gkoumas,Bo Wang,Adam Tsakalidis,Maria Wolters,Arkaitz Zubiaga,Matthew Purver,Maria Liakata 机构:School of Electronic Engineering and Computer Science, Queen Mary University of London, UK, Department of Psychiatry, University of Oxford, UK, The Alan Turing Institute, London, UK, School of Informatics, University of Edinburgh, UK 摘要:痴呆症是一个神经生成性疾病家族,影响全球老龄化人口中越来越多的人的记忆和认知。语言、言语和副语言指标的自动分析作为认知能力下降的潜在指标越来越受欢迎。在这里,我们提出了一个新的纵向多模式数据集,从轻度痴呆患者和年龄匹配的对照组在自然环境中收集几个月。多模态数据包括口语对话(其中一部分是转录的)、打字和书面想法以及相关的语言外信息,如笔划和按键。我们详细描述了数据集,并继续关注使用语音模态的任务。后者涉及通过利用数据的纵向性质来区分对照组和痴呆症患者。我们的实验表明,在控制组和痴呆组中,不同时段的讲话方式存在显著差异。 摘要:Dementia is a family of neurogenerative conditions affecting memory and cognition in an increasing number of individuals in our globally aging population. Automated analysis of language, speech and paralinguistic indicators have been gaining popularity as potential indicators of cognitive decline. Here we propose a novel longitudinal multi-modal dataset collected from people with mild dementia and age matched controls over a period of several months in a natural setting. The multi-modal data consists of spoken conversations, a subset of which are transcribed, as well as typed and written thoughts and associated extra-linguistic information such as pen strokes and keystrokes. We describe the dataset in detail and proceed to focus on a task using the speech modality. The latter involves distinguishing controls from people with dementia by exploiting the longitudinal nature of the data. Our experiments showed significant differences in how the speech varied from session to session in the control and dementia groups.
【2】 An Exploratory Study on Utilising the Web of Linked Data for Product Data Mining 标题:利用关联数据网络进行产品数据挖掘的探索性研究 链接:https://arxiv.org/abs/2109.01411
作者:Ziqi Zhang,Xingyi Song 机构:Received: date Accepted: date 备注:Currently under review at LRE journal 摘要:在过去十年中,链接开放数据实践导致了Web上结构化数据的显著增长。这种结构化数据以机器可读的方式描述真实世界的实体,为自然语言处理领域的研究创造了前所未有的机会。然而,对于这些数据如何使用、用于什么样的任务以及它们在多大程度上对这些任务有用,缺乏研究。这项工作侧重于电子商务领域,探索利用此类结构化数据创建可用于产品分类和链接的语言资源的方法。我们以RDF n-quads的形式处理数十亿结构化数据点,以创建数百万字的产品相关语料库,这些语料库随后以三种不同的方式用于创建语言资源:训练单词嵌入模型、继续预训练类BERT语言模型,以及训练机器翻译模型,这些模型用作生成产品相关关键字的代理。我们对大量基准测试的评估表明,单词嵌入是提高这两项任务准确性的最可靠和一致的方法(在某些数据集上,宏平均F1高达6.9个百分点)。然而,其他两种方法没有那么有用。我们的分析表明,这可能是由于一些原因造成的,包括结构化数据中有偏见的领域表示和词汇覆盖率不足。我们分享我们的数据集,并讨论如何利用我们的经验教训,为这方面的未来研究提供信息。 摘要:The Linked Open Data practice has led to a significant growth of structured data on the Web in the last decade. Such structured data describe real-world entities in a machine-readable way, and have created an unprecedented opportunity for research in the field of Natural Language Processing. However, there is a lack of studies on how such data can be used, for what kind of tasks, and to what extent they can be useful for these tasks. This work focuses on the e-commerce domain to explore methods of utilising such structured data to create language resources that may be used for product classification and linking. We process billions of structured data points in the form of RDF n-quads, to create multi-million words of product-related corpora that are later used in three different ways for creating of language resources: training word embedding models, continued pre-training of BERT-like language models, and training Machine Translation models that are used as a proxy to generate product-related keywords. Our evaluation on an extensive set of benchmarks shows word embeddings to be the most reliable and consistent method to improve the accuracy on both tasks (with up to 6.9 percentage points in macro-average F1 on some datasets). The other two methods however, are not as useful. Our analysis shows that this could be due to a number of reasons, including the biased domain representation in the structured data and lack of vocabulary coverage. We share our datasets and discuss how our lessons learned could be taken forward to inform future research in this direction.
【3】 Indexing Context-Sensitive Reachability 标题:索引上下文相关的可达性 链接:https://arxiv.org/abs/2109.01321
作者:Qingkai Shi,Yongchao Wang,Charles Zhang 机构:The Hong Kong University of Science and Technology and Ant Group, The Hong Kong Universityof Science and Technology and Ant Group, The Hong Kong University ofScience and Technology 摘要:许多上下文敏感的数据流分析可以表示为所有对Dyck CFL可达性问题的变体,一般来说,该问题具有次立方时间复杂度和二次空间复杂度。如此高的复杂性极大地限制了上下文敏感数据流分析的可伸缩性,并且对于分析大型软件来说是负担不起的。本文介绍了textsc{Flare},一种将上下文敏感数据流分析中的CFL可达性问题简化为传统的图可达性问题的方法。这种减少使我们能够受益于可达性索引方案的最新进展,该方案通常在几乎恒定的时间内消耗几乎线性的空间来回答可达性查询。我们已经将我们的简化应用于C/C 程序的上下文敏感别名分析和上下文敏感信息流分析。在标准基准测试和开放源代码软件上的实验结果表明,我们可以实现几个数量级的加速,但只需占用适当的空间来存储索引。我们的方法的实施是公开的。 摘要:Many context-sensitive data flow analyses can be formulated as a variant of the all-pairs Dyck-CFL reachability problem, which, in general, is of sub-cubic time complexity and quadratic space complexity. Such high complexity significantly limits the scalability of context-sensitive data flow analysis and is not affordable for analyzing large-scale software. This paper presents textsc{Flare}, a reduction from the CFL reachability problem to the conventional graph reachability problem for context-sensitive data flow analysis. This reduction allows us to benefit from recent advances in reachability indexing schemes, which often consume almost linear space for answering reachability queries in almost constant time. We have applied our reduction to a context-sensitive alias analysis and a context-sensitive information-flow analysis for C/C programs. Experimental results on standard benchmarks and open-source software demonstrate that we can achieve orders of magnitude speedup at the cost of only moderate space to store the indexes. The implementation of our approach is publicly available.
【4】 So Cloze yet so Far: N400 Amplitude is Better Predicted by Distributional Information than Human Predictability Judgements 标题:所以到目前为止,完形填空:与人类可预测性判断相比,分布信息更好地预测了N400振幅 链接:https://arxiv.org/abs/2109.01226
作者:James A. Michaelov,Seana Coulson,Benjamin K. Bergen 机构:Bergen are with the Department ofCognitive Science, University of California San Diego 备注:Submitted 摘要:更易预测的单词更容易处理——它们的阅读速度更快,并引发与处理困难相关的更小的神经信号,最显著的是事件相关脑电位的N400成分。因此,有人认为,预测即将出现的单词是语言理解的一个关键组成部分,而研究N400的振幅是研究我们所做预测的一个有价值的方法。在这项研究中,我们调查了计算语言模型或人类的语言预测是否更好地反映了自然语言刺激调节N400振幅的方式。人类语言预测与计算语言模型的一个重要区别是,虽然语言模型的预测完全基于前面的语言背景,但人类可能依赖其他因素。我们发现三种顶尖的当代语言模型——GPT-3、RoBERTa和ALBERT——的预测比人类的预测更接近N400。这表明N400背后的预测过程可能比以前认为的对语言表层统计更敏感。 摘要:More predictable words are easier to process - they are read faster and elicit smaller neural signals associated with processing difficulty, most notably, the N400 component of the event-related brain potential. Thus, it has been argued that prediction of upcoming words is a key component of language comprehension, and that studying the amplitude of the N400 is a valuable way to investigate the predictions that we make. In this study, we investigate whether the linguistic predictions of computational language models or humans better reflect the way in which natural language stimuli modulate the amplitude of the N400. One important difference in the linguistic predictions of humans versus computational language models is that while language models base their predictions exclusively on the preceding linguistic context, humans may rely on other factors. We find that the predictions of three top-of-the-line contemporary language models - GPT-3, RoBERTa, and ALBERT - match the N400 more closely than human predictions. This suggests that the predictive processes underlying the N400 may be more sensitive to the surface-level statistics of language than previously thought.
【5】 Quantifying Reproducibility in NLP and ML 标题:NLP和ML中可重复性的量化 链接:https://arxiv.org/abs/2109.01211
作者:Anya Belz 机构:ADAPT Centre, Dublin City University, Ireland 摘要:近年来,再现性已成为NLP和ML中一个激烈争论的话题,但迄今为止还没有公认的评估再现性的方法,更不用说量化了。假设更广泛的科学再现性术语和定义不适用于NLP/ML,因此提出了许多不同的术语和定义,其中一些是截然相反的。在本文中,我们检验了这个假设,通过采用计量学中的标准术语和定义,并将其直接应用于NLP/ML。我们发现,我们能够直接得出评估再现性的实用框架,该框架具有产生可在不同再现性之间进行比较的量化再现性程度的理想性质研究。 摘要:Reproducibility has become an intensely debated topic in NLP and ML over recent years, but no commonly accepted way of assessing reproducibility, let alone quantifying it, has so far emerged. The assumption has been that wider scientific reproducibility terminology and definitions are not applicable to NLP/ML, with the result that many different terms and definitions have been proposed, some diametrically opposed. In this paper, we test this assumption, by taking the standard terminology and definitions from metrology and applying them directly to NLP/ML. We find that we are able to straightforwardly derive a practical framework for assessing reproducibility which has the desirable property of yielding a quantified degree of reproducibility that is comparable across different reproduction studies.
机器翻译,仅供参考