访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
cs.CL 方向,今日共计39篇
Transformer(2篇)
【1】 Ad Text Classification with Transformer-Based Natural Language Processing Methods 标题:基于转换器的自然语言处理方法在广告文本分类中的应用
作者:Umut Özdil,Büşra Arslan,D. Emre Taşar,Gökçe Polat,Şükrü Ozan 机构:∗AdresGezgini A.¸S. Ar-Ge Merkezi,˙Izmir, Türkiye, †Dokuz Eylül Üniversitesi Y.B.S Y.L. Ögr., ˙Izmir, Türkiye, Özet—Bu çalı¸smada, çevrimiçi reklam platformlarında olu¸s-, turulan, metinlerinin, göre, otomatik 备注:6 pages, in Turkish language, 4 figures, 3 tables, 25. Pazarlama Konferans{i} (25th Marketing Conference) 链接:https://arxiv.org/abs/2106.10899 摘要:本文提出了一种基于自然语言处理的在线广告文本自动分类方法。我们的数据集由来自12个不同行业的约21000条贴标广告文本组成。在本研究中,我们使用了一种基于Transformer的双向编码器表示(BERT)模型,这是一种最近在自然语言处理文献中用于文本分类等领域的基于Transformer的语言模型。使用预先训练好的BERT模型对土耳其语的分类效率进行了详细的说明。 摘要:In this study, a natural language processing-based (NLP-based) method is proposed for the sector-wise automatic classification of ad texts created on online advertising platforms. Our data set consists of approximately 21,000 labeled advertising texts from 12 different sectors. In the study, the Bidirectional Encoder Representations from Transformers (BERT) model, which is a transformer-based language model that is recently used in fields such as text classification in the natural language processing literature, was used. The classification efficiencies obtained using a pre-trained BERT model for the Turkish language are shown in detail.
【2】 Transformers for Headline Selection for Russian News Clusters 标题:俄罗斯新闻集群标题选择的Transformer
作者:Pavel Voropaev,Olga Sopilnyak 机构:Moscow Institute, of Physics and Technology, Moscow, Russia 备注:Accepted to Dialogue 2021 conference 链接:https://arxiv.org/abs/2106.10487 摘要:在本文中,我们探讨了各种多语种和俄语预训练Transformer为基础的模型,为对话评估2021年共同的任务标题选择。实验结果表明,该方法优于单独的多语言和单语模型。我们分析了几种获取句子嵌入的方法,并在此基础上学习了一个排名模型。对于公共测试集和私有测试集,我们分别获得了87.28%和86.60%的准确率。 摘要:In this paper, we explore various multilingual and Russian pre-trained transformer-based models for the Dialogue Evaluation 2021 shared task on headline selection. Our experiments show that the combined approach is superior to individual multilingual and monolingual models. We present an analysis of a number of ways to obtain sentence embeddings and learn a ranking model on top of them. We achieve the result of 87.28% and 86.60% accuracy for the public and private test sets respectively.
QA|VQA|问答|对话(1篇)
【1】 Learning to Rank Question Answer Pairs with Bilateral Contrastive Data Augmentation 标题:基于双边对比数据增强的问答对排序学习
作者:Yang Deng,Wenxuan Zhang,Wai Lam 机构:The Chinese University of Hong Kong 链接:https://arxiv.org/abs/2106.11096 摘要:在这项工作中,我们提出了一种新颖且易于应用的数据扩充策略,即双边生成(BiG),其目的是通过对比训练来提高现有标记数据对问答对排序的性能。具体来说,我们通过两个预先训练的生成模型(一个用于问题生成,另一个用于答案生成)来合成伪正QA对,这两个模型在原始数据集中有限的正QA对上进行微调。利用增广的数据集,我们设计了一个对比训练目标来学习问题-答案对的排序。在TREC-QA、WikiQA和ANTIQUE三个基准数据集上的实验结果表明,该方法充分利用了已有的标记数据,显著提高了排序模型的性能,并且可以方便地应用于不同的排序模型。 摘要:In this work, we propose a novel and easy-to-apply data augmentation strategy, namely Bilateral Generation (BiG), with a contrastive training objective for improving the performance of ranking question answer pairs with existing labeled data. In specific, we synthesize pseudo-positive QA pairs in contrast to the original negative QA pairs with two pre-trained generation models, one for question generation, the other for answer generation, which are fine-tuned on the limited positive QA pairs from the original dataset. With the augmented dataset, we design a contrastive training objective for learning to rank question answer pairs. Experimental results on three benchmark datasets, namely TREC-QA, WikiQA, and ANTIQUE, show that our method significantly improves the performance of ranking models by making full use of existing labeled data and can be easily applied to different ranking models.
机器翻译(1篇)
【1】 Challenges in Translation of Emotions in Multilingual User-Generated Content: Twitter as a Case Study 标题:多语言用户生成内容中情感翻译的挑战--以Twitter为例
作者:Hadeel Saadany,Constantin Orasan,Rocio Caro Quintana,Felix do Carmo,Leonardo Zilio 机构:Centre for Translation Studies, University of Surrey, Constantin Or˘asan, Roc´ıo Caro Quintana, RGCL, University of Wolverhampton, F´elix do Carmo 链接:https://arxiv.org/abs/2106.10719 摘要:尽管情感是一个普遍的概念,但是将不同的情感从一种语言转移到另一种语言对于人类译者来说并不总是那么简单,更不用说机器翻译系统了。此外,认知状态是由语言和文化语境共同塑造的对经验的语言解释来建立的。在许多言语语境中,情感的表达构成了信息的关键组成部分。这对于用户生成的内容(UGC)尤其如此,它可以是产品或服务的评论、tweet或社交媒体帖子的形式。最近,像推特这样的多语种网站通常会提供教资会的自动翻译,以接触不同语种的使用者。在这种情况下,翻译用户情绪的过程是完全自动的,没有人为干预,既不用于后期编辑,也不用于准确性检查。在这项研究中,我们评估了自动翻译工具是否能够成功地在真实生活中传递用户生成的多语言数据(如tweets)中的情感。我们发现Twitter数据中有一些特定的语言现象对不同语言中情感的翻译提出了挑战。我们在一系列语言特征中总结了这些挑战,并展示了这些特征在不同语言对中的出现频率。我们还评估了评价机器翻译系统在保持原文情感方面的性能的常用方法的能力。 摘要:Although emotions are universal concepts, transferring the different shades of emotion from one language to another may not always be straightforward for human translators, let alone for machine translation systems. Moreover, the cognitive states are established by verbal explanations of experience which is shaped by both the verbal and cultural contexts. There are a number of verbal contexts where expression of emotions constitutes the pivotal component of the message. This is particularly true for User-Generated Content (UGC) which can be in the form of a review of a product or a service, a tweet, or a social media post. Recently, it has become common practice for multilingual websites such as Twitter to provide an automatic translation of UGC to reach out to their linguistically diverse users. In such scenarios, the process of translating the user's emotion is entirely automatic with no human intervention, neither for post-editing nor for accuracy checking. In this research, we assess whether automatic translation tools can be a successful real-life utility in transferring emotion in user-generated multilingual data such as tweets. We show that there are linguistic phenomena specific of Twitter data that pose a challenge in translation of emotions in different languages. We summarise these challenges in a list of linguistic features and show how frequent these features are in different language pairs. We also assess the capacity of commonly used methods for evaluating the performance of an MT system with respect to the preservation of emotion in the source text.
语义分析(2篇)
【1】 STEP-EZ: Syntax Tree guided semantic ExPlanation for Explainable Zero-shot modeling of clinical depression symptoms from text 标题:STEP-EZ:句法树引导的语义解释对临床抑郁症状文本的可解释零射建模
作者:Nawshad Farruque,Randy Goebel,Osmar Zaiane,Sudhakar Sivapalan 机构:Sivapalan, Department of Computing Science, University of Alberta, Alberta Machine Intelligence Institute (AMII), University of Alberta, Department of Psychiatry, University of Alberta 链接:https://arxiv.org/abs/2106.10928 摘要:我们致力于探索Zero-Shot学习(ZSL)的各种方法,以及它们对一个因训练数据匮乏而臭名昭著的具有挑战性但又很重要的有监督学习任务的解释能力,即从文本中检测抑郁症状(DSD)。我们首先在临床医生的帮助下,对ZSL模型的不同组成部分进行综合,并对我们的基本事实样本和抑郁症症状线索的治疗过程进行分析。接下来,我们将分析各种最先进的ZSL模型的准确性以及它们对我们任务的潜在增强。此外,我们还为使用ZSL进行基于文本的分层解释机制勾画了一个框架,我们称之为语法树指导的语义解释(STEP)。最后,我们总结了实验结果,从中我们可以得出结论,我们可以使用ZSL模型,并达到合理的准确性和解释性,衡量了提出的解释性指数(EI)。据我们所知,这项工作是第一次从准确性和解释性两个方面全面探讨ZSL模型在DSD任务中的有效性。 摘要:We focus on exploring various approaches of Zero-Shot Learning (ZSL) and their explainability for a challenging yet important supervised learning task notorious for training data scarcity, i.e. Depression Symptoms Detection (DSD) from text. We start with a comprehensive synthesis of different components of our ZSL modeling and analysis of our ground truth samples and Depression symptom clues curation process with the help of a practicing clinician. We next analyze the accuracy of various state-of-the-art ZSL models and their potential enhancements for our task. Further, we sketch a framework for the use of ZSL for hierarchical text-based explanation mechanism, which we call, Syntax Tree-Guided Semantic Explanation (STEP). Finally, we summarize experiments from which we conclude that we can use ZSL models and achieve reasonable accuracy and explainability, measured by a proposed Explainability Index (EI). This work is, to our knowledge, the first work to exhaustively explore the efficacy of ZSL models for DSD task, both in terms of accuracy and explainability.
【2】 A Brief Study on the Effects of Training Generative Dialogue Models with a Semantic loss 标题:浅谈语义缺失的生成性对话模式的训练效果
作者:Prasanna Parthasarathi,Mohamed Abdelsalam,Joelle Pineau,Sarath Chandar 机构: School of Computer Science, McGill University , University of Montr´eal, ´Ecole Polytechnique de Montr´eal, Quebec Artificial Intelligence Institute (Mila), Canada CIFAR AI Chair 备注:Accepted at SIGDial 2021 链接:https://arxiv.org/abs/2106.10619 摘要:在对话任务中,为下一代话语训练的神经模型学习模仿训练集中的n-gram序列,训练目标为负对数似然(NLL)或交叉熵。这种常用的训练目标并不能促进对特定环境的替代反应。但是,最小化替代训练目标对生成替代反应并对其进行评分的模型对语义相似度的影响还没有得到很好的研究。我们假设一个语言生成模型可以通过在训练过程中学习生成交替的文本并将语义损失最小化作为辅助目标来改善其多样性。我们在两个不同大小的数据集上探讨了这一观点,并以目标导向对话中的下一代话语为研究对象。我们做了两个观察:(1)最小化一个语义目标在较小的数据集(帧)中改进了响应的多样性,但只与最小化较大数据集(MultiWoZ)中的NLL一样好;(2)大型语言模型嵌入作为语义丢失目标比作为令牌嵌入的初始化更有用。 摘要:Neural models trained for next utterance generation in dialogue task learn to mimic the n-gram sequences in the training set with training objectives like negative log-likelihood (NLL) or cross-entropy. Such commonly used training objectives do not foster generating alternate responses to a context. But, the effects of minimizing an alternate training objective that fosters a model to generate alternate response and score it on semantic similarity has not been well studied. We hypothesize that a language generation model can improve on its diversity by learning to generate alternate text during training and minimizing a semantic loss as an auxiliary objective. We explore this idea on two different sized data sets on the task of next utterance generation in goal oriented dialogues. We make two observations (1) minimizing a semantic objective improved diversity in responses in the smaller data set (Frames) but only as-good-as minimizing the NLL in the larger data set (MultiWoZ) (2) large language model embeddings can be more useful as a semantic loss objective than as initialization for token embeddings.
Graph|知识图谱|Knowledge(5篇)
【1】 Toward Knowledge Discovery Framework for Data Science Job Market in the United States 标题:美国数据科学就业市场的知识发现框架
作者:Mojtaba Heidarysafa,Kamran Kowsari,Masoud Bashiri,Donald E. Brown 机构: Department of Systems and Information Engineering, University of Virginia, Office of Health Informatics and Analytics, University of California, Los Angeles, School of Data Science, University of Virginia 链接:https://arxiv.org/abs/2106.11077 摘要:数据科学领域的发展需要更好的工具来理解这样一个快速增长的领域。此外,来自不同背景的人开始对从事数据科学家的职业感兴趣。因此,为个人和组织提供一个量化的指南,以了解就业市场所需的技能将是至关重要的。本文介绍了一个分析美国数据科学相关工作岗位就业市场的框架,同时提供了一个了解该市场的界面。提出的框架包括三个子模块,允许连续数据收集、信息提取和基于web的仪表板可视化,以调查数据科学相关工作和技能的时空分布。这项工作的结果显示了数据科学工作的主要分支的重要技能,并试图为这些数据科学分支提供一个基于技能的定义。该应用程序的当前版本部署在web上,允许个人和机构通过行业视角调查数据科学职位所需的技能。 摘要:The growth of the data science field requires better tools to understand such a fast-paced growing domain. Moreover, individuals from different backgrounds became interested in following a career as data scientists. Therefore, providing a quantitative guide for individuals and organizations to understand the skills required in the job market would be crucial. This paper introduces a framework to analyze the job market for data science-related jobs within the US while providing an interface to access insights in this market. The proposed framework includes three sub-modules allowing continuous data collection, information extraction, and a web-based dashboard visualization to investigate the spatial and temporal distribution of data science-related jobs and skills. The result of this work shows important skills for the main branches of data science jobs and attempts to provide a skill-based definition of these data science branches. The current version of this application is deployed on the web and allows individuals and institutes to investigate skills required for data science positions through the industry lens.
【2】 Extractive approach for text summarisation using graphs 标题:一种基于图的文本摘要抽取方法
作者:Kastriot Kadriu,Milenko Obradovic 机构:University of Ljubljana, Veˇcna pot , SI-, Ljubljana, Slovenia 备注:4 pages, 2 figures, 5 tables 链接:https://arxiv.org/abs/2106.10955 摘要:自然语言处理是一门重要的学科,其目的是通过文本的数字表示来理解文本,但由于我们的书写和说话方式的多样性,往往不够准确。本文探讨了不同的图形相关算法,可用于解决文本摘要问题的提取方法。我们考虑两个量度:句子重叠和编辑距离来衡量句子相似度。 摘要:Natural language processing is an important discipline with the aim of understanding text by its digital representation, that due to the diverse way we write and speak, is often not accurate enough. Our paper explores different graph-related algorithms that can be used in solving the text summarization problem using an extractive approach. We consider two metrics: sentence overlap and edit distance for measuring sentence similarity.
【3】 ROPE: Reading Order Equivariant Positional Encoding for Graph-based Document Information Extraction 标题:ROPE:基于图的文档信息抽取的阅读顺序等变位置编码
作者:Chen-Yu Lee,Chun-Liang Li,Chu Wang,Renshen Wang,Yasuhisa Fujii,Siyang Qin,Ashok Popat,Tomas Pfister 机构:†Google Cloud AI, §McGill University, ‡ Google Research 备注:Accepted to ACL-IJCNLP 2021 (Oral) 链接:https://arxiv.org/abs/2106.10786 摘要:自然的单词阅读顺序对于从类似表单的文档中提取信息至关重要。尽管最近图卷积网络(GCNs)在文档空间布局模式建模方面取得了一些进展,但它们在捕获给定单词级节点表示的读取顺序方面能力有限。我们提出阅读顺序等变位置编码(ROPE),这是一种新的位置编码技术,旨在理解文档中单词的顺序表示。ROPE为给定单词级图连通性的目标单词相关的相邻单词生成唯一的读取顺序代码。在公共FUNSD数据集和大规模支付数据集上,我们研究了两个基本的文档实体抽取任务,包括单词标注和单词分组。我们表明,ROPE持续改进现有的GCNs,其F1分数的差值高达8.4%。 摘要:Natural reading orders of words are crucial for information extraction from form-like documents. Despite recent advances in Graph Convolutional Networks (GCNs) on modeling spatial layout patterns of documents, they have limited ability to capture reading orders of given word-level node representations in a graph. We propose Reading Order Equivariant Positional Encoding (ROPE), a new positional encoding technique designed to apprehend the sequential presentation of words in documents. ROPE generates unique reading order codes for neighboring words relative to the target word given a word-level graph connectivity. We study two fundamental document entity extraction tasks including word labeling and word grouping on the public FUNSD dataset and a large-scale payment dataset. We show that ROPE consistently improves existing GCNs with a margin up to 8.4% F1-score.
【4】 JointGT: Graph-Text Joint Representation Learning for Text Generation from Knowledge Graphs 标题:JointGT:用于从知识图生成文本的图文联合表示学习
作者:Pei Ke,Haozhe Ji,Yu Ran,Xin Cui,Liwei Wang,Linfeng Song,Xiaoyan Zhu,Minlie Huang 机构:The CoAI group, Department of Computer Science and Technology, Institute for Artificial Intelligence, State Key Lab of Intelligent Technology and Systems, Beijing National Research Center for Information Science and Technology 备注:ACL 2021 (Findings) 链接:https://arxiv.org/abs/2106.10502 摘要:现有的知识图到文本(KG-to-text)生成的预训练模型只是对文本到文本的预训练模型进行微调,例如在KG-to-text数据集上的BART或T5,这在很大程度上忽略了编码过程中的图结构,并且缺乏详细的预训练任务来显式地建模图-文本对齐。为了解决这些问题,我们提出了一个称为JointGT的图文联合表示学习模型。在编码过程中,我们设计了一个结构感知的语义聚合模块,嵌入到每个转换层中,以保持图的结构。此外,我们提出了三个新的预训练任务来显式地增强图-文本对齐,包括各自的文本/图重建,以及通过最优传输在嵌入空间进行图-文本对齐。实验表明,JointGT在不同的KG-to-text数据集上获得了最新的性能。 摘要:Existing pre-trained models for knowledge-graph-to-text (KG-to-text) generation simply fine-tune text-to-text pre-trained models such as BART or T5 on KG-to-text datasets, which largely ignore the graph structure during encoding and lack elaborate pre-training tasks to explicitly model graph-text alignments. To tackle these problems, we propose a graph-text joint representation learning model called JointGT. During encoding, we devise a structure-aware semantic aggregation module which is plugged into each Transformer layer to preserve the graph structure. Furthermore, we propose three new pre-training tasks to explicitly enhance the graph-text alignment including respective text / graph reconstruction, and graph-text alignment in the embedding space via Optimal Transport. Experiments show that JointGT obtains new state-of-the-art performance on various KG-to-text datasets.
【5】 Enhancing Question Generation with Commonsense Knowledge 标题:利用常识知识促进问题生成
作者:Xin Jia,Hao Wang,Dawei Yin,Yunfang Wu 机构:†MOE Key Lab of Computational Linguistics, School of EECS, Peking University, ‡Baidu Inc., China 备注:Accepted by CCL2021 链接:https://arxiv.org/abs/2106.10454 摘要:问题生成(QG)是在给定的语境中生成自然的、语法的问题,这些问题可以由特定的答案来回答。以往的序列到序列模型存在一个问题,即提出高质量的问题需要以常识知识为背景,在大多数情况下不能直接从训练数据中学习,导致不满意的问题被剥夺了知识。本文提出了一个多任务学习框架,将常识知识引入到问题生成过程中。我们首先从成熟的数据库中检索相关的常识知识三元组,然后根据从源上下文到问题的转换信息选择三元组。基于这些信息性知识三元组,我们设计了两个辅助任务,一个是概念关系分类,另一个是尾部概念生成。在SQuAD上的实验结果表明,我们提出的方法能够显著提高QG在自动和人工评价指标上的性能,表明将外部常识知识与多任务学习相结合可以帮助模型生成类人的、高质量的问题。 摘要:Question generation (QG) is to generate natural and grammatical questions that can be answered by a specific answer for a given context. Previous sequence-to-sequence models suffer from a problem that asking high-quality questions requires commonsense knowledge as backgrounds, which in most cases can not be learned directly from training data, resulting in unsatisfactory questions deprived of knowledge. In this paper, we propose a multi-task learning framework to introduce commonsense knowledge into question generation process. We first retrieve relevant commonsense knowledge triples from mature databases and select triples with the conversion information from source context to question. Based on these informative knowledge triples, we design two auxiliary tasks to incorporate commonsense knowledge into the main QG model, where one task is Concept Relation Classification and the other is Tail Concept Generation. Experimental results on SQuAD show that our proposed methods are able to noticeably improve the QG performance on both automatic and human evaluation metrics, demonstrating that incorporating external commonsense knowledge with multi-task learning can help the model generate human-like and high-quality questions.
摘要|信息提取(1篇)
【1】 A Condense-then-Select Strategy for Text Summarization 标题:一种先压缩后选择的文本摘要策略
作者:Hou Pong Chan,Irwin King 机构:Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China., Department of Computer and Information Science, University of Macau, Macau SAR 备注:Accepted by Knowledge-Based Systems (KBS) journal 链接:https://arxiv.org/abs/2106.10468 摘要:Select-then-compress是一种流行的混合式文本摘要框架,具有较高的效率。这个框架首先选择突出的句子,然后独立地将每个句子浓缩成一个简明的版本。然而,单独压缩句子会忽略文档的上下文信息,因此容易删除显著信息。针对这一局限性,本文提出了一种新的文本摘要压缩选择框架。我们的框架首先同时压缩每个文档语句。原始文档语句及其压缩版本将成为提取的候选对象。最后,提取器利用文档的上下文信息来选择候选文档并将它们组合成摘要。如果在压缩过程中删除了显著信息,提取器可以选择一个原始句子来保留信息。因此,我们的框架有助于避免显著信息的丢失,同时保持句子级压缩的高效率。在CNN/DailyMail、DUC-2002和Pubmed数据集上的实验结果表明,我们的框架优于先选择后压缩框架和其他强基线。 摘要:Select-then-compress is a popular hybrid, framework for text summarization due to its high efficiency. This framework first selects salient sentences and then independently condenses each of the selected sentences into a concise version. However, compressing sentences separately ignores the context information of the document, and is therefore prone to delete salient information. To address this limitation, we propose a novel condense-then-select framework for text summarization. Our framework first concurrently condenses each document sentence. Original document sentences and their compressed versions then become the candidates for extraction. Finally, an extractor utilizes the context information of the document to select candidates and assembles them into a summary. If salient information is deleted during condensing, the extractor can select an original sentence to retain the information. Thus, our framework helps to avoid the loss of salient information, while preserving the high efficiency of sentence-level compression. Experiment results on the CNN/DailyMail, DUC-2002, and Pubmed datasets demonstrate that our framework outperforms the select-then-compress framework and other strong baselines.
推理|分析|理解|解释(2篇)
【1】 Understanding the Dynamics between Vaping and Cannabis Legalization Using Twitter Opinions 标题:用推特上的观点理解大麻合法化和Vaping之间的动态
作者:Shishir Adhikari,Akshay Uppal,Robin Mermelstein,Tanya Berger-Wolf,Elena Zheleva 机构:Computer Science, University of Illinois at Chicago, Psychology; Institute for Health Research and Policy, University of Illinois at Chicago, Computer Science and Engineering; Electrical and Computer Engineering; Evolution, Ecology, and Organismal Biology; 备注:Published at ICWSM 2021 链接:https://arxiv.org/abs/2106.11029 摘要:大麻合法化受到美国许多州的欢迎,但其在从使用烟草电子烟升级为吸食大麻方面的作用尚不清楚。与此同时,吸食大麻与新的肺部疾病和青少年使用率上升有关。为了了解大麻合法化对升级的影响,我们设计了一项观察性研究,以评估娱乐性大麻合法化对电子烟使用者亲大麻态度发展的因果影响。我们收集并分析了Twitter数据,其中包含了对大麻和JUUL(一个非常流行的电子香烟品牌)的看法。我们使用弱监督学习对个人微博进行过滤,并分类进行姿态检测。我们发现,休闲大麻合法化政策对已经支持电子烟的使用者的亲大麻态度的发展产生了影响。 摘要:Cannabis legalization has been welcomed by many U.S. states but its role in escalation from tobacco e-cigarette use to cannabis vaping is unclear. Meanwhile, cannabis vaping has been associated with new lung diseases and rising adolescent use. To understand the impact of cannabis legalization on escalation, we design an observational study to estimate the causal effect of recreational cannabis legalization on the development of pro-cannabis attitude for e-cigarette users. We collect and analyze Twitter data which contains opinions about cannabis and JUUL, a very popular e-cigarette brand. We use weakly supervised learning for personal tweet filtering and classification for stance detection. We discover that recreational cannabis legalization policy has an effect on increased development of pro-cannabis attitudes for users already in favor of e-cigarettes.
【2】 Out of Context: A New Clue for Context Modeling of Aspect-based Sentiment Analysis 标题:上下文之外:基于方面情感分析的上下文建模新线索
作者:Bowen Xing,Ivor W. Tsang 机构:Australian Artificial Intelligence Institute, University of Technology Sydney, Ultimo, NSW , Australia 备注:Submitted to JAIR 链接:https://arxiv.org/abs/2106.10816 摘要:基于方面的情绪分析(ABSA)的目的是预测评论中所表达的关于某一方面的情绪。ABSA的核心是建立上下文与特定方面之间的交互模型,提取与方面相关的信息。在以往的工作中,人们通常采用注意机制和依赖图网络来捕捉上下文和特定方面之间的关系。将上下文隐藏状态的加权和作为分类器的最终表示。然而,与给定方面相关的信息可能已经被丢弃,并且不利信息可能被保留在现有模型的上下文建模过程中。这一问题是后续模块无法解决的,原因有二:一是它们的操作都是在编码器生成的上下文隐藏状态上进行的,其值在编码器运行后不能改变;第二,现有的编码器只考虑上下文而不考虑给定的方面。为了解决这个问题,我们认为在上下文建模过程中,给定的方面应该被视为上下文之外的一条新线索。在解决方案方面,我们基于不同的主干设计了几个方面感知的上下文编码器:一个方面感知的LSTM和三个方面感知的BERTs。它们专门用于生成面向方面的隐藏状态,这些隐藏状态是为ABSA任务定制的。在这些感知方面的上下文编码器中,给定方面的语义被用来调节信息流。因此,在生成的隐藏状态中,可以保留与方面相关的信息,并且可以排除与方面无关的信息。我们在多个基准数据集上进行了大量的实验,并进行了实证分析,证明了我们提出的面向方面上下文编码器的有效性和优越性。 摘要:Aspect-based sentiment analysis (ABSA) aims to predict the sentiment expressed in a review with respect to a given aspect. The core of ABSA is to model the interaction between the context and given aspect to extract the aspect-related information. In prior work, attention mechanisms and dependency graph networks are commonly adopted to capture the relations between the context and given aspect. And the weighted sum of context hidden states is used as the final representation fed to the classifier. However, the information related to the given aspect may be already discarded and adverse information may be retained in the context modeling processes of existing models. This problem cannot be solved by subsequent modules and there are two reasons: first, their operations are conducted on the encoder-generated context hidden states, whose value cannot change after the encoder; second, existing encoders only consider the context while not the given aspect. To address this problem, we argue the given aspect should be considered as a new clue out of context in the context modeling process. As for solutions, we design several aspect-aware context encoders based on different backbones: an aspect-aware LSTM and three aspect-aware BERTs. They are dedicated to generate aspect-aware hidden states which are tailored for ABSA task. In these aspect-aware context encoders, the semantics of the given aspect is used to regulate the information flow. Consequently, the aspect-related information can be retained and aspect-irrelevant information can be excluded in the generated hidden states. We conduct extensive experiments on several benchmark datasets with empirical analysis, demonstrating the efficacies and advantages of our proposed aspect-aware context encoders.
GAN|对抗|攻击|生成相关(1篇)
【1】 Empower Distantly Supervised Relation Extraction with Collaborative Adversarial Training 标题:协同对抗性训练增强远程监督关系抽取能力
作者:Tao Chen,Haochen Shi,Liyuan Liu,Siliang Tang,Jian Shao,Zhigang Chen,Yueting Zhuang 机构:Zhejiang University ,University of Illinois at Urbana Champaign ,iFLYTEK Research 备注:Accepted by AAAI 2021 链接:https://arxiv.org/abs/2106.10835 摘要:近年来,随着远程监督关系提取技术的发展,利用多实例学习技术从含噪声的数据源中提取高质量的监督关系越来越受到人们的重视。在这里,我们超越了标签噪声,发现DS-MIL的关键瓶颈是数据利用率低:MIL对高质量的监督进行细化,MIL放弃了大量的训练实例,导致数据利用率低,阻碍了模型训练的充分监督。在本文中,我们提出了协同对抗训练来提高数据利用率,它在不同层次上协调虚拟对抗训练(VAT)和对抗训练(AT)。具体来说,因为VAT是无标签的,所以我们使用实例级VAT来回收MIL放弃的实例。此外,我们在包级别部署,以充分发挥MIL获得的高质量监管的潜力。我们所提出的方法带来了一致的改进(~5绝对AUC分数)以前的最新状态,这验证了数据利用问题的重要性和我们的方法的有效性。 摘要:With recent advances in distantly supervised (DS) relation extraction (RE), considerable attention is attracted to leverage multi-instance learning (MIL) to distill high-quality supervision from the noisy DS. Here, we go beyond label noise and identify the key bottleneck of DS-MIL to be its low data utilization: as high-quality supervision being refined by MIL, MIL abandons a large amount of training instances, which leads to a low data utilization and hinders model training from having abundant supervision. In this paper, we propose collaborative adversarial training to improve the data utilization, which coordinates virtual adversarial training (VAT) and adversarial training (AT) at different levels. Specifically, since VAT is label-free, we employ the instance-level VAT to recycle instances abandoned by MIL. Besides, we deploy AT at the bag-level to unleash the full potential of the high-quality supervision got by MIL. Our proposed method brings consistent improvements (~ 5 absolute AUC score) to the previous state of the art, which verifies the importance of the data utilization issue and the effectiveness of our method.
半/弱/无监督|不确定性(3篇)
【1】 Iterative Network Pruning with Uncertainty Regularization for Lifelong Sentiment Classification 标题:基于不确定性正则化的迭代网络修剪终生情感分类
作者:Binzong Geng,Min Yang,Fajie Yuan,Shupeng Wang,Xiang Ao,Ruifeng Xu 机构:University of Science and Technology of China, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Westlake University, Key Lab of Intelligent Information Processing of Chinese Academy of Sciences, Institute of Computing Technology, CAS 备注:Accepted by the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2021 链接:https://arxiv.org/abs/2106.11197 摘要:终身学习能力对于情感分类器处理网络上连续不断的观点信息流至关重要。然而,对于深度神经网络来说,进行终身学习是非常重要的,因为不断地训练增量可用信息不可避免地会导致灾难性的遗忘或干扰。本文利用网络剪枝和权值正则化的原理,提出了一种基于不确定性正则化的迭代网络剪枝终身情感分类方法。IPRLS通过迭代的方式进行网络剪枝和不确定性正则化,可以适应单个BERT模型处理来自多个域的连续到达数据,同时避免灾难性的遗忘和干扰。具体地说,我们利用一种迭代剪枝方法来去除大型深度网络中的冗余参数,这样释放出来的空间就可以用来学习新的任务,解决灾难性的遗忘问题。在学习新任务时,我们没有保持旧任务不变,而是使用基于贝叶斯在线学习框架的不确定性正则化来约束BERT中旧任务权重的更新,从而实现正后向迁移,也就是说,学习新任务可以提高过去任务的绩效,同时保护旧知识不被丢失。此外,我们还提出了一个任务相关的低维残差函数并行于BERT的每一层,使得IPRLS在学习新任务时不容易丢失基本BERT网络中的知识。在16个流行评论语料库上进行的大量实验表明,所提出的IPRLS方法明显优于强基线方法。为了再现性,我们将代码和数据提交至:https://github.com/siat-nlp/IPRLS. 摘要:Lifelong learning capabilities are crucial for sentiment classifiers to process continuous streams of opinioned information on the Web. However, performing lifelong learning is non-trivial for deep neural networks as continually training of incrementally available information inevitably results in catastrophic forgetting or interference. In this paper, we propose a novel iterative network pruning with uncertainty regularization method for lifelong sentiment classification (IPRLS), which leverages the principles of network pruning and weight regularization. By performing network pruning with uncertainty regularization in an iterative manner, IPRLS can adapta single BERT model to work with continuously arriving data from multiple domains while avoiding catastrophic forgetting and interference. Specifically, we leverage an iterative pruning method to remove redundant parameters in large deep networks so that the freed-up space can then be employed to learn new tasks, tackling the catastrophic forgetting problem. Instead of keeping the old-tasks fixed when learning new tasks, we also use an uncertainty regularization based on the Bayesian online learning framework to constrain the update of old tasks weights in BERT, which enables positive backward transfer, i.e. learning new tasks improves performance on past tasks while protecting old knowledge from being lost. In addition, we propose a task-specific low-dimensional residual function in parallel to each layer of BERT, which makes IPRLS less prone to losing the knowledge saved in the base BERT network when learning a new task. Extensive experiments on 16 popular review corpora demonstrate that the proposed IPRLS method sig-nificantly outperforms the strong baselines for lifelong sentiment classification. For reproducibility, we submit the code and data at:https://github.com/siat-nlp/IPRLS.
【2】 ArgFuse: A Weakly-Supervised Framework for Document-Level Event Argument Aggregation 标题:ArgFuse:一种用于文档级事件参数聚合的弱监督框架
作者:Debanjana Kar,Sudeshna Sarkar,Pawan Goyal 机构:Department of Computer Science & Engineering, Indian Institute of Technology, Kharagpur. 备注:11 pages, 8 figures, Accepted in Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE) @ACL-IJCNLP 2021 链接:https://arxiv.org/abs/2106.10862 摘要:大多数现有的信息提取框架(Wadden et al.,2019;Veyshet等人,2020年)专注于句子级任务,很难从给定文档中获取综合信息。为了从冗长的文本记录中生成精确的文档级信息框架,我们引入了信息聚合或参数聚合的任务。更具体地说,我们的目标是过滤不相关的和冗余的论据提及,提取在一个句子级和呈现一个文档级的信息框架。大多数现有的工作都是为了解决文档级事件参数提取的相关任务(Yang等人,2018a;Zheng等人,2019a)和使用监督技术的显著实体识别(Jain等人,2020)。为了从大量的标记数据中去除依赖性,我们探索了使用弱监督技术进行信息聚合的任务。特别地,我们提出了一种多筛子的抽取算法,该算法采用主动学习策略在低资源环境下高效地工作。为此,我们对自己的测试数据集进行了注释,其中包含131个文档信息框架,并发布了代码和数据集,对这一新领域的进一步研究进行了展望。据我们所知,我们是第一个用英语为这项任务建立基线结果的人。我们的数据和代码在https://github.com/DebanjanaKar/ArgFuse. 摘要:Most of the existing information extraction frameworks (Wadden et al., 2019; Veysehet al., 2020) focus on sentence-level tasks and are hardly able to capture the consolidated information from a given document. In our endeavour to generate precise document-level information frames from lengthy textual records, we introduce the task of Information Aggregation or Argument Aggregation. More specifically, our aim is to filter irrelevant and redundant argument mentions that were extracted at a sentence level and render a document level information frame. Majority of the existing works have been observed to resolve related tasks of document-level event argument extraction (Yang et al., 2018a; Zheng et al., 2019a) and salient entity identification (Jain et al.,2020) using supervised techniques. To remove dependency from large amounts of labelled data, we explore the task of information aggregation using weakly-supervised techniques. In particular, we present an extractive algorithm with multiple sieves which adopts active learning strategies to work efficiently in low-resource settings. For this task, we have annotated our own test dataset comprising of 131 document information frames and have released the code and dataset to further research prospects in this new domain. To the best of our knowledge, we are the first to establish baseline results for this task in English. Our data and code are publicly available at https://github.com/DebanjanaKar/ArgFuse.
【3】 CIL: Contrastive Instance Learning Framework for Distantly Supervised Relation Extraction 标题:CIL:用于远程监督关系抽取的对比实例学习框架
作者:Tao Chen,Haizhou Shi,Siliang Tang,Zhigang Chen,Fei Wu,Yueting Zhuang 机构:Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies, State Key Laboratory of Cognitive Intelligence, Hefei, China 备注:Accepted by ACL 2021 链接:https://arxiv.org/abs/2106.10855 摘要:自DS首次引入关系抽取(RE)任务以来,远程监控(DS)生成的训练数据的降噪之旅已经开始。在过去的十年中,研究人员应用多实例学习(MIL)框架从一袋句子中寻找最可靠的特征。虽然MIL-bags模式可以大大降低DS噪声,但它不能表示数据集中许多有用的句子特征。在很多情况下,这些句子特征只能通过额外的句子级人工标注来获取,代价很高。因此,远程监督重模型的性能是有界的。在本文中,我们超越了典型的MIL框架,提出了一种新的对比实例学习(CIL)框架。具体而言,我们将初始MIL视为相关的三重编码器,并对每个实例的正对和负对进行约束。实验证明了该框架的有效性,并在NYT10、GDS和KBP上进行了改进。 摘要:The journey of reducing noise from distant supervision (DS) generated training data has been started since the DS was first introduced into the relation extraction (RE) task. For the past decade, researchers apply the multi-instance learning (MIL) framework to find the most reliable feature from a bag of sentences. Although the pattern of MIL bags can greatly reduce DS noise, it fails to represent many other useful sentence features in the datasets. In many cases, these sentence features can only be acquired by extra sentence-level human annotation with heavy costs. Therefore, the performance of distantly supervised RE models is bounded. In this paper, we go beyond typical MIL framework and propose a novel contrastive instance learning (CIL) framework. Specifically, we regard the initial MIL as the relational triple encoder and constraint positive pairs against negative pairs for each instance. Experiments demonstrate the effectiveness of our proposed framework, with significant improvements over the previous methods on NYT10, GDS and KBP.
检测相关(1篇)
【1】 Hybrid approach to detecting symptoms of depression in social media entries 标题:检测社交媒体条目中抑郁症状的混合方法
作者:Agnieszka Wołk,Karol Chlasta,Paweł Holas 机构:Polish-Japanese Academy of Information, Technology, The Institute of Literary Research of the, Polish Academy of Sciences, Koszykowa ,-, Warsaw, Nowy Świat ,-, Warsaw, Kozminski University, Jagiellońska ,,-, Warsaw, University of Warsaw 备注:11 pages, 4 figures, 2 tables, The Pacific Asia Conference on Information Systems (PACIS2021) 链接:https://arxiv.org/abs/2106.10485 摘要:情绪和词汇分析被广泛用于检测抑郁症或焦虑症。有文献记载,与健康人相比,情绪障碍患者所使用的语言存在显著差异。尽管如此,这些词汇方法的有效性还可以进一步提高,因为目前的分析重点是社交媒体条目是关于什么的,而不是它们是如何写的。在这项研究中,我们将重点放在这些短文彼此相似的方面,以及它们是如何产生的。我们提出了一种新颖的方法来解决抑郁症筛查问题,它是一种已知的从文本中获取语言信息的有效方法。我们将这些结果与基于BERT结构的情感分析进行了比较。最后,我们建立了一个混合模型,实现了71%的诊断准确率。 摘要:Sentiment and lexical analyses are widely used to detect depression or anxiety disorders. It has been documented that there are significant differences in the language used by a person with emotional disorders in comparison to a healthy individual. Still, the effectiveness of these lexical approaches could be improved further because the current analysis focuses on what the social media entries are about, and not how they are written. In this study, we focus on aspects in which these short texts are similar to each other, and how they were created. We present an innovative approach to the depression screening problem by applying Collgram analysis, which is a known effective method of obtaining linguistic information from texts. We compare these results with sentiment analysis based on the BERT architecture. Finally, we create a hybrid model achieving a diagnostic accuracy of 71%.
识别/分类(2篇)
【1】 Does Robustness Improve Fairness? Approaching Fairness with Word Substitution Robustness Methods for Text Classification 标题:稳健性是否能提高公平性?基于词替换稳健性的文本分类公平性研究
作者:Yada Pruksachatkun,Satyapriya Krishna,Jwala Dhamala,Rahul Gupta,Kai-Wei Chang 机构:UCLA, Amazon Alexa 链接:https://arxiv.org/abs/2106.10826 摘要:现有的偏倚缓解方法,以减少不同队列之间的模型结果的差异,集中在数据增加,减少模型嵌入,或增加公平性为基础的优化目标,在训练。另外,已开发出经验证的词替换鲁棒性方法,以减少虚假特征和同义词替换对模型预测的影响。虽然它们的最终目标不同,但它们都旨在鼓励模型对输入的某些变化做出相同的预测。在这篇论文中,我们研究了验证词替换稳健性方法在多文本分类任务中改善机会均等和几率均等的效用。我们观察到,经过认证的稳健性方法提高了公平性,并且在训练中同时使用稳健性和偏差缓解方法在两个方面都取得了改进 摘要:Existing bias mitigation methods to reduce disparities in model outcomes across cohorts have focused on data augmentation, debiasing model embeddings, or adding fairness-based optimization objectives during training. Separately, certified word substitution robustness methods have been developed to decrease the impact of spurious features and synonym substitutions on model predictions. While their end goals are different, they both aim to encourage models to make the same prediction for certain changes in the input. In this paper, we investigate the utility of certified word substitution robustness methods to improve equality of odds and equality of opportunity on multiple text classification tasks. We observe that certified robustness methods improve fairness, and using both robustness and bias mitigation methods in training results in an improvement in both fronts
【2】 Improving Compositional Generalization in Classification Tasks via Structure Annotations 标题:通过结构标注改进分类任务中的成分泛化
作者:Juyong Kim,Pradeep Ravikumar,Joshua Ainslie,Santiago Ontañón 机构:Carnegie Mellon University, Santiago Onta˜n´on, Google Research 备注:Accepted as a short paper at ACL 2021 链接:https://arxiv.org/abs/2106.10434 摘要:组合泛化是通过组合已知的成分,系统地泛化到一个新的数据分布的能力。虽然人类似乎有很强的合成概括能力,但最先进的神经模型很难做到这一点。在这项工作中,我们研究了分类任务中的合成概括,并提出了两个主要贡献。首先,我们研究如何将自然语言序列转换成序列数据集,再转换成同样需要合成泛化的分类数据集。其次,我们展示了提供结构提示(特别是提供解析树和实体链接作为转换器模型的注意遮罩)有助于合成泛化。 摘要:Compositional generalization is the ability to generalize systematically to a new data distribution by combining known components. Although humans seem to have a great ability to generalize compositionally, state-of-the-art neural models struggle to do so. In this work, we study compositional generalization in classification tasks and present two main contributions. First, we study ways to convert a natural language sequence-to-sequence dataset to a classification dataset that also requires compositional generalization. Second, we show that providing structural hints (specifically, providing parse trees and entity links as attention masks for a Transformer model) helps compositional generalization.
表征(1篇)
【1】 Do Encoder Representations of Generative Dialogue Models Encode Sufficient Information about the Task ? 标题:生成性对话模型的编码表示是否编码了关于任务的足够信息?
作者:Prasanna Parthasarathi,Joelle Pineau,Sarath Chandar 机构: School of Computer Science, McGill University, Quebec Artificial Intelligence Institute (Mila), Canada, École Polytechnique de Montréal, Canada CIFAR AI Chair 备注:Accepted at SIGDial 2021. arXiv admin note: substantial text overlap with arXiv:2008.10427 链接:https://arxiv.org/abs/2106.10622 摘要:在数据驱动的方法中,预测对话中的下一个话语取决于对用户输入文本的编码,从而生成适当的相关响应。虽然生成的语言的语义和句法质量得到了评估,但输入的编码表示往往没有得到评估。由于编码器的表示对于预测适当的响应是必不可少的,因此编码器表示的评估是一个具有挑战性的重要问题。在这项工作中,我们展示了评估通过人工或自动度量生成的文本不足以恰当地评估对话模型的语言理解的可靠性,为此,我们提出了一组探测任务来评估对话模型中常用的不同语言编码器的编码器表示。通过实验,我们观察到有些探测任务更容易,有些甚至对于复杂的模型体系结构来说更难学习。并且,通过实验我们观察到基于RNN的体系结构在文本生成的自动度量上比transformer模型的性能要低,但是在探测任务上比transformer模型的性能更好,这表明RNN可能比transformer模型更好地保存任务信息。 摘要:Predicting the next utterance in dialogue is contingent on encoding of users' input text to generate appropriate and relevant response in data-driven approaches. Although the semantic and syntactic quality of the language generated is evaluated, more often than not, the encoded representation of input is not evaluated. As the representation of the encoder is essential for predicting the appropriate response, evaluation of encoder representation is a challenging yet important problem. In this work, we showcase evaluating the text generated through human or automatic metrics is not sufficient to appropriately evaluate soundness of the language understanding of dialogue models and, to that end, propose a set of probe tasks to evaluate encoder representation of different language encoders commonly used in dialogue models. From experiments, we observe that some of the probe tasks are easier and some are harder for even sophisticated model architectures to learn. And, through experiments we observe that RNN based architectures have lower performance on automatic metrics on text generation than transformer model but perform better than the transformer model on the probe tasks indicating that RNNs might preserve task information better than the Transformers.
Word2Vec|文本|单词(1篇)
【1】 Multi-Pair Text Style Transfer on Unbalanced Data 标题:不平衡数据上的多对文本样式转换
作者:Xing Han,Jessica Lundin 机构:University of Texas at Austin, Austin, TX, Salesforce, San Francisco, CA 备注:Meta Learning and Its Applications to Natural Language Processing, ACL 2021 Workshop 链接:https://arxiv.org/abs/2106.10608 摘要:文本风格转换的目的是在不改变文本内容的情况下,通过复述句子或替换关键字,将一个领域的文本转换成另一个领域的文本。在必要的情况下,最先进的方法已经发展到适应非平行的训练数据,因为通常情况下,有多个大小不等的数据源,有标记和无标记的句子混合在一起。此外,在每个源中定义的固有样式可能是不同的。通用的双向(例如,正式的$Leftrightarrow$非正式的)样式传输,不管不同的组,可能无法很好地推广到不同的应用程序。在这项工作中,我们开发了一个任务自适应元学习框架,可以同时执行多对文本风格转换使用一个单一的模式。该方法能够自适应地平衡多任务间元知识的差异。结果表明,我们的方法可以得到更好的定量性能以及连贯的风格变化。这种方法很好地解决了数据不平衡和域不匹配的常见问题。 摘要:Text-style transfer aims to convert text given in one domain into another by paraphrasing the sentence or substituting the keywords without altering the content. By necessity, state-of-the-art methods have evolved to accommodate nonparallel training data, as it is frequently the case there are multiple data sources of unequal size, with a mixture of labeled and unlabeled sentences. Moreover, the inherent style defined within each source might be distinct. A generic bidirectional (e.g., formal $Leftrightarrow$ informal) style transfer regardless of different groups may not generalize well to different applications. In this work, we developed a task adaptive meta-learning framework that can simultaneously perform a multi-pair text-style transfer using a single model. The proposed method can adaptively balance the difference of meta-knowledge across multiple tasks. Results show that our method leads to better quantitative performance as well as coherent style variations. Common challenges of unbalanced data and mismatched domains are handled well by this method.
其他神经网络|深度学习|模型|建模(9篇)
【1】 A Discriminative Entity-Aware Language Model for Virtual Assistants 标题:一种区分实体的虚拟助手语言模型
作者:Mandana Saebi,Ernest Pusateri,Aaksha Meghawat,Christophe Van Gysel 机构:University of Notre Dame, Notre Dame, IN, USA, Apple, Cupertino, CA, USA 备注:To appear in Interspeech 2021 链接:https://arxiv.org/abs/2106.11292 摘要:高质量的自动语音识别(ASR)是虚拟助理(VAs)正常工作的关键。但是,ASR在包含命名实体的VA请求上的性能通常很差。在这项工作中,我们从观察命名实体上的许多ASR错误与实际知识不一致开始。我们使用捕捉实体类型实体和实体-实体关系的特征,扩展了以往的区分性n-gram语言建模方法,将知识图(KG)中的真实世界知识结合起来。我们通过一个有效的格重排序过程来应用我们的模型,在一些包含不太流行的实体的综合测试集上实现了超过25%的相对句子错误率降低,而在一个均匀采样的VA测试集上退化最小。 摘要:High-quality automatic speech recognition (ASR) is essential for virtual assistants (VAs) to work well. However, ASR often performs poorly on VA requests containing named entities. In this work, we start from the observation that many ASR errors on named entities are inconsistent with real-world knowledge. We extend previous discriminative n-gram language modeling approaches to incorporate real-world knowledge from a Knowledge Graph (KG), using features that capture entity type-entity and entity-entity relationships. We apply our model through an efficient lattice rescoring process, achieving relative sentence error rate reductions of more than 25% on some synthesized test sets covering less popular entities, with minimal degradation on a uniformly sampled VA test set.
【2】 Self-Calibrating Neural-Probabilistic Model for Authorship Verification Under Covariate Shift 标题:协变量漂移下的作者自校准神经概率模型
作者:Benedikt Boenninghoff,Dorothea Kolossa,Robert M. Nickel 机构: Ruhr University Bochum, Bucknell University, USA 备注:12th International Conference of the CLEF Association, 2021 链接:https://arxiv.org/abs/2106.11196 摘要:我们正在解决两个基本问题,在作者身份验证(AV):主题变异和校准错误。两个有争议的文本主题的变化是大多数视听系统出错的主要原因。此外,可以观察到,由深度学习AV机制产生的潜在概率估计常常与相应训练数据中的实际案例计数不匹配。因此,概率估计的校准很差。我们正在扩展我们的框架,从泛2020年,包括贝叶斯因子评分(BFS)和不确定性适应层(UAL)来解决这两个问题。通过对2020/21pan-AV共享任务数据的实验表明,该方法显著降低了对局部变化的敏感性,显著提高了系统的标定精度。 摘要:We are addressing two fundamental problems in authorship verification (AV): Topic variability and miscalibration. Variations in the topic of two disputed texts are a major cause of error for most AV systems. In addition, it is observed that the underlying probability estimates produced by deep learning AV mechanisms oftentimes do not match the actual case counts in the respective training data. As such, probability estimates are poorly calibrated. We are expanding our framework from PAN 2020 to include Bayes factor scoring (BFS) and an uncertainty adaptation layer (UAL) to address both problems. Experiments with the 2020/21 PAN AV shared task data show that the proposed method significantly reduces sensitivities to topical variations and significantly improves the system's calibration.
【3】 Explicit Interaction Network for Aspect Sentiment Triplet Extraction 标题:面向体感三元组提取的显式交互网络
作者:Peiyi Wang,Lianzhe Huang,Tianyu Liu,Damai Dai,Runxin Xu,Houfeng Wang,Baobao Chang,Zhifang Sui 机构:MOE Key Lab of Computational Linguistics, Peking University, Beijing, China 链接:https://arxiv.org/abs/2106.11148 摘要:体感三元组提取(ASTE)的目的是识别目标、目标的情感极性和解释句子情感的观点。ASTE可以自然地划分为3个子任务,即目标检测、观点检测和情感分类。我们认为正确的子任务组合、目标意见对的组合特征提取以及子任务之间的交互是成功的关键。然而,以前的工作可能会在“一对多”或“多对一”的情况下失败,或者由于子任务的表述有缺陷、次优的特征表示或缺乏子任务交互而产生不存在的情感三胞胎。本文将ASTE分为符合人类认知的目标观点联合检测和情感分类两个子任务,并相应地提出了序列编码器和表编码器。表编码器在标记对层次上提取情感信息,从而可以很容易地获取目标和观点之间的组合特征。为了建立子任务之间的显式交互,我们利用表表示来指导序列编码,并将序列特征注入到表编码器中。实验表明,在6个流行的ASTE数据集上,我们的模型优于现有的方法。 摘要:Aspect Sentiment Triplet Extraction (ASTE) aims to recognize targets, their sentiment polarities and opinions explaining the sentiment from a sentence. ASTE could be naturally divided into 3 atom subtasks, namely target detection, opinion detection and sentiment classification. We argue that the proper subtask combination, compositional feature extraction for target-opinion pairs, and interaction between subtasks would be the key to success. Prior work, however, may fail on `one-to-many' or `many-to-one' situations, or derive non-existent sentiment triplets due to defective subtask formulation, sub-optimal feature representation or the lack of subtask interaction. In this paper, we divide ASTE into target-opinion joint detection and sentiment classification subtasks, which is in line with human cognition, and correspondingly propose sequence encoder and table encoder. Table encoder extracts sentiment at token-pair level, so that the compositional feature between targets and opinions can be easily captured. To establish explicit interaction between subtasks, we utilize the table representation to guide the sequence encoding, and inject the sequence features back into the table encoder. Experiments show that our model outperforms state-of-the-art methods on six popular ASTE datasets.
【4】 Leveraging Language to Learn Program Abstractions and Search Heuristics 标题:利用语言学习程序抽象和搜索启发式
作者:Catherine Wong,Kevin Ellis,Joshua B. Tenenbaum,Jacob Andreas 机构: a framework for improving the ef-ficiency and generalizability of learned program synthesis 1MIT 2Cornell University 3Center for Brains 备注:appeared in Thirty-eighth International Conference on Machine Learning (ICML 2021) 链接:https://arxiv.org/abs/2106.11053 摘要:归纳程序综合,或从期望行为的例子中推断程序,为建立可解释的、健壮的和可推广的机器学习系统提供了一个通用的范例。有效的程序综合取决于两个关键要素:一个强大的函数库,从中生成程序;一个高效的搜索策略,用于查找解决给定任务的程序。我们介绍了LAPS(Language for Abstraction and Program Search),一种使用自然语言注释来指导库的联合学习的技术,以及用于合成的神经引导搜索模型。当集成到最先进的库学习系统(DreamCoder)中时,LAPS生成更高质量的库,并在字符串编辑、图像合成和场景抽象推理三个领域提高搜索效率和泛化能力,即使在测试时没有可用的自然语言提示。 摘要:Inductive program synthesis, or inferring programs from examples of desired behavior, offers a general paradigm for building interpretable, robust, and generalizable machine learning systems. Effective program synthesis depends on two key ingredients: a strong library of functions from which to build programs, and an efficient search strategy for finding programs that solve a given task. We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis. When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization on three domains -- string editing, image composition, and abstract reasoning about scenes -- even when no natural language hints are available at test time.
【5】 Interventional Video Grounding with Dual Contrastive Learning 标题:基于双重对比学习的介入性视频寻根
作者:Guoshun Nan,Rui Qiao,Yao Xiao,Jun Liu,Sicong Leng,Hao Zhang,Wei Lu 机构: StatNLP Research Group, Singapore University of Technology and Design, Shanghai Jiao Tong University, China, Information Systems Technology and Design, Singapore University of Technology and Design, Singapore 备注:Accepted in CVPR 2021 链接:https://arxiv.org/abs/2106.11013 摘要:视频接地的目的是定位一个时刻从一个未经剪辑的视频为一个给定的文本查询。现有的方法更多地关注视觉和语言刺激与各种基于可能性的匹配或回归策略的匹配,即P(Y | X)。因此,由于数据集的选择偏差,这些模型可能会受到语言和视频特征之间虚假相关性的影响。1) 为了揭示模型和数据背后的因果关系,我们首先从因果推理的角度提出了一种新的范式,即基于结构化因果模型(SCM)和do演算P(Y | do(X))的介入性视频接地(IVG)。然后,我们提出了一个简单而有效的方法来近似未观察到的混杂因素,因为它不能直接从数据集中采样。2) 同时,我们引入了一种双重对比学习方法(DCL),通过最大化查询和视频片段之间的互信息(MI)以及目标时刻的开始帧/结束帧与视频中其他帧之间的互信息(MI)来更好地对齐文本和视频,以学习更多信息的视觉表示。在三个标准基准上的实验表明了该方法的有效性。 摘要:Video grounding aims to localize a moment from an untrimmed video for a given textual query. Existing approaches focus more on the alignment of visual and language stimuli with various likelihood-based matching or regression strategies, i.e., P(Y|X). Consequently, these models may suffer from spurious correlations between the language and video features due to the selection bias of the dataset. 1) To uncover the causality behind the model and data, we first propose a novel paradigm from the perspective of the causal inference, i.e., interventional video grounding (IVG) that leverages backdoor adjustment to deconfound the selection bias based on structured causal model (SCM) and do-calculus P(Y|do(X)). Then, we present a simple yet effective method to approximate the unobserved confounder as it cannot be directly sampled from the dataset. 2) Meanwhile, we introduce a dual contrastive learning approach (DCL) to better align the text and video by maximizing the mutual information (MI) between query and video clips, and the MI between start/end frames of a target moment and the others within a video to learn more informative visual representations. Experiments on three standard benchmarks show the effectiveness of our approaches.
【6】 TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning 标题:TCIC:主题概念、跨语言学习和图像字幕视觉
作者:Zhihao Fan,Zhongyu Wei,Siyuan Wang,Ruize Wang,Zejun Li,Haijun Shan,Xuanjing Huang 机构:Zhejiang Lab, Research Institute of Intelligent and Complex Systems, Fudan University, China 备注:IJCAI2021 链接:https://arxiv.org/abs/2106.10936 摘要:现有的图像字幕的研究通常是用一个具有低层次事实(对象和关系)的场景图来表示图像,而没有捕捉到高层次的语义。在本文中,我们提出了一个主题概念扩展图像字幕(TCIC)框架,其中包含了主题概念来表示高级跨模态语义。在实践中,我们将主题概念建模为记忆向量,并提出了主题节点转换器(Transformer-with-theme-Nodes,TTN)来整合这些向量用于图像字幕。考虑到主题概念可以从图像和字幕中学习,我们提出了两种基于TTN的主题概念表征学习设置。在视觉方面,TTN被配置为将基于场景图的特征和主题概念作为视觉表示学习的输入。在语言方面,TTN被配置为将字幕和主题概念作为文本表示重构的输入。这两种设置都旨在使用相同的基于转换器的解码器生成目标字幕。在训练过程中,我们进一步将从图像中学习到的主题概念的表达与相应的字幕对齐,以加强跨模态学习。在MS-COCO上的实验结果表明,与一些最新的模型相比,我们的方法是有效的。 摘要:Existing research for image captioning usually represents an image using a scene graph with low-level facts (objects and relations) and fails to capture the high-level semantics. In this paper, we propose a Theme Concepts extended Image Captioning (TCIC) framework that incorporates theme concepts to represent high-level cross-modality semantics. In practice, we model theme concepts as memory vectors and propose Transformer with Theme Nodes (TTN) to incorporate those vectors for image captioning. Considering that theme concepts can be learned from both images and captions, we propose two settings for their representations learning based on TTN. On the vision side, TTN is configured to take both scene graph based features and theme concepts as input for visual representation learning. On the language side, TTN is configured to take both captions and theme concepts as input for text representation re-construction. Both settings aim to generate target captions with the same transformer-based decoder. During the training, we further align representations of theme concepts learned from images and corresponding captions to enforce the cross-modality learning. Experimental results on MS COCO show the effectiveness of our approach compared to some state-of-the-art models.
【7】 Context-Aware Legal Citation Recommendation using Deep Learning 标题:基于深度学习的上下文感知法律引文推荐
作者:Zihan Huang,Charles Low,Mengqiu Teng,Hongyi Zhang,Daniel E. Ho,Mark S. Krass,Matthias Grabmair 机构:Language Technologies Institute, Carnegie Mellon University, Stanford University, Department of Informatics, Technical University of Munich, SINC GmbH 备注:10 pages published in Proceedings of ICAIL 2021; link to data here: this https URL ; code available here: this https URL 链接:https://arxiv.org/abs/2106.10776 摘要:律师和法官花了大量时间研究起草判决时引用的适当法律权威。在本文中,我们开发了一个引文推荐工具,可以帮助提高意见起草过程中的效率。我们训练了四种机器学习模型,包括一种基于引文列表的方法(协作过滤)和三种基于上下文的方法(文本相似度、BiLSTM和RoBERTa分类器)。我们的实验表明,利用本地文本上下文可以提高推荐,并且深层神经模型可以获得良好的性能。我们表明,非深度文本方法受益于对结构化案例元数据的访问,但深度模型仅在从长度不足的上下文进行预测时受益于这种访问。我们还发现,即使经过广泛的训练,RoBERTa并没有超过一个循环神经模型,尽管它的好处是预训练。我们对RoBERTa模型的行为分析进一步表明,预测性能在时间和引文类别上是稳定的。 摘要:Lawyers and judges spend a large amount of time researching the proper legal authority to cite while drafting decisions. In this paper, we develop a citation recommendation tool that can help improve efficiency in the process of opinion drafting. We train four types of machine learning models, including a citation-list based method (collaborative filtering) and three context-based methods (text similarity, BiLSTM and RoBERTa classifiers). Our experiments show that leveraging local textual context improves recommendation, and that deep neural models achieve decent performance. We show that non-deep text-based methods benefit from access to structured case metadata, but deep models only benefit from such access when predicting from context of insufficient length. We also find that, even after extensive training, RoBERTa does not outperform a recurrent neural model, despite its benefits of pretraining. Our behavior analysis of the RoBERTa model further shows that predictive performance is stable across time and citation classes.
【8】 CPM-2: Large-scale Cost-effective Pre-trained Language Models 标题:CPM-2:大规模高性价比的预训练语言模型
作者:Zhengyan Zhang,Yuxian Gu,Xu Han,Shengqi Chen,Chaojun Xiao,Zhenbo Sun,Yuan Yao,Fanchao Qi,Jian Guan,Pei Ke,Yanzheng Cai,Guoyang Zeng,Zhixing Tan,Zhiyuan Liu,Minlie Huang,Wentao Han,Yang Liu,Xiaoyan Zhu,Maosong Sun 机构:Department of Computer Science and Technology, Tsinghua University & BAAI 链接:https://arxiv.org/abs/2106.10715 摘要:近年来,预训练语言模型(plm)的规模突飞猛进。然而,这些大规模plm的效率问题限制了它们在现实场景中的应用。我们提出了一套成本效益的技术,使用PLM来处理预训练、微调和推理的效率问题(1) 我们引入知识继承来加速预训练过程,利用现有的plm代替从头开始的训练模型(2) 我们探索用大规模锁相环进行快速调谐的最佳实践。与常规微调相比,快速微调显著减少了任务特定参数的数量(3) 我们实现了一个新的推理工具InfMoE,用于在有限的计算资源下使用大规模plm。基于我们的高性价比流水线,我们预先训练了两个模型:110亿参数的编译码双语模型(CPM-2)和1980亿参数的MoE模型。在我们的实验中,我们比较了下游任务的CPM-2和mT5。实验结果表明,CPM-2具有良好的通用语言智能。此外,我们还验证了InfMoE在单个GPU上对具有数百亿个参数的大规模模型进行推理时的有效性。所有源代码和模型参数都可以在https://github.com/TsinghuaAI/CPM. 摘要:In recent years, the size of pre-trained language models (PLMs) has grown by leaps and bounds. However, efficiency issues of these large-scale PLMs limit their utilization in real-world scenarios. We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. (1) We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. (2) We explore the best practice of prompt tuning with large-scale PLMs. Compared with conventional fine-tuning, prompt tuning significantly reduces the number of task-specific parameters. (3) We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources. Based on our cost-effective pipeline, we pre-train two models: an encoder-decoder bilingual model with 11 billion parameters (CPM-2) and its corresponding MoE version with 198 billion parameters. In our experiments, we compare CPM-2 with mT5 on downstream tasks. Experimental results show that CPM-2 has excellent general language intelligence. Moreover, we validate the efficiency of InfMoE when conducting inference of large-scale models having tens of billions of parameters on a single GPU. All source code and model parameters are available at https://github.com/TsinghuaAI/CPM.
【9】 Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets 标题:使用以值为目标的数据集使语言模型适应社会(Palms)的过程
作者:Irene Solaiman,Christy Dennison 机构:OpenAI 备注:Both authors contributed equally. Submitted to NeurIPS 2021 链接:https://arxiv.org/abs/2106.10328 摘要:语言模型会产生有害的和有偏见的输出,并表现出不受欢迎的行为。我们提出了一个用以值为目标的数据集使语言模型适应社会(PALMS)的过程,这是一个迭代过程,通过对反映一组预定目标值的数据集进行精心设计和微调来显著改变模型行为。我们使用三个指标来评估我们的过程:定量指标与人类评估,评分输出遵守目标值,和毒性评分的产出;定性指标分析与特定社会类别相关的最常见词汇。通过每次迭代,我们添加额外的训练数据集的例子,根据观察到的缺点,从评估。与基线模型和控制模型相比,PALMS在各种GPT-3语言模型尺寸上的性能都有显著的提高,而且不影响功能完整性。我们发现手掌的有效性随着模型尺寸的增加而增加。我们证明了在一个小的、手工管理的数据集上显著调整语言模型行为是可行的。 摘要:Language models can generate harmful and biased outputs and exhibit undesirable behavior. We propose a Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets, an iterative process to significantly change model behavior by crafting and fine-tuning on a dataset that reflects a predetermined set of target values. We evaluate our process using three metrics: quantitative metrics with human evaluations that score output adherence to a target value, and toxicity scoring on outputs; and qualitative metrics analyzing the most common word associated with a given social category. Through each iteration, we add additional training dataset examples based on observed shortcomings from evaluations. PALMS performs significantly better on all metrics compared to baseline and control models for a broad range of GPT-3 language model sizes without compromising capability integrity. We find that the effectiveness of PALMS increases with model size. We show that significantly adjusting language model behavior is feasible with a small, hand-curated dataset.
其他(7篇)
【1】 Abstract Geometrical Computation 11: Slanted Firing Squad Synchronisation on Signal Machines 标题:抽象几何计算11:信号机上的斜射方队同步
作者:Jérôme Durand-Lose,Aurélien Emmanuel 机构:Université d’Orléans, INSA Centre Val de Loire, LIFO EA , FR-, Orléans, France 备注:21 pages,29 figures 链接:https://arxiv.org/abs/2106.11176 摘要:元胞自动机上的行刑队同步是有限多个元胞在没有任何先验知识的情况下动态同步。这可以被认为是一个具有无限速度的信号。大多数提议的构造自然地转化为信号机的连续设置,并生成在水平线上累积的分形图形,即在时空图中同步地。在《抽象几何计算》系列文章中对信号机进行了研究。在本文中,我们设计了一个能够在任何非无限斜坡上同步/积累的信号机。斜率在初始配置中进行编码。这是通过构造一个无限树来实现的,这样每个节点都计算树扩展的方式。抽象几何计算的兴趣在于摆脱离散空间的约束,同时解决连续空间的新困难。本文的目的是为进一步研究信号机模型中的可计算累积线提供基本工具。 摘要:Firing Squad Synchronisation on Cellular Automata is the dynamical synchronisation of finitely many cells without any prior knowledge of their range. This can be conceived as a signal with an infinite speed. Most of the proposed constructions naturally translate to the continuous setting of signal machines and generate fractal figures with an accumulation on a horizontal line, i.e. synchronously, in the space-time diagram. Signal machines are studied in a series of articles named Abstract Geometrical Computation. In the present article, we design a signal machine that is able to synchronise/accumulate on any non-infinite slope. The slope is encoded in the initial configuration. This is done by constructing an infinite tree such that each node computes the way the tree expands. The interest of Abstract Geometrical computation is to do away with the constraint of discrete space, while tackling new difficulties from continuous space. The interest of this paper in particular is to provide basic tools for further study of computable accumulation lines in the signal machine model.
【2】 QuaPy: A Python-Based Framework for Quantification 标题:QuaPy:一个基于Python的量化框架
作者:Alejandro Moreo,Andrea Esuli,Fabrizio Sebastiani 机构:Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Via Giuseppe Moruzzi , Pisa, Italy 链接:https://arxiv.org/abs/2106.11057 摘要:QuaPy是一个用Python编写的用于执行量化(也称为监督流行率估计)的开源框架。量化是通过监督学习训练量词的任务,其中量词是一个预测因子,用于估计未标记数据样本中感兴趣类别的相对频率(又称流行值)。虽然可以通过对每个未标记的数据项应用标准分类器并计算分配给每个类的数据项的数量来执行量化,但是已经表明,这种“分类和计数”方法的性能优于专门为量化设计的方法。QuaPy提供了许多基线方法和高级量化方法的实现、面向量化的模型选择例程、一些广泛接受的评估度量以及在现场常规使用的健壮的评估协议。QuaPy还提供了常用于测试量词的数据集,并提供了便于分析和解释结果的可视化工具。该软件是开放源代码的,通过BSD-3许可证公开提供https://github.com/HLT-ISTI/QuaPy,并可通过pip安装(https://pypi.org/project/QuaPy/) 摘要:QuaPy is an open-source framework for performing quantification (a.k.a. supervised prevalence estimation), written in Python. Quantification is the task of training quantifiers via supervised learning, where a quantifier is a predictor that estimates the relative frequencies (a.k.a. prevalence values) of the classes of interest in a sample of unlabelled data. While quantification can be trivially performed by applying a standard classifier to each unlabelled data item and counting how many data items have been assigned to each class, it has been shown that this "classify and count" method is outperformed by methods specifically designed for quantification. QuaPy provides implementations of a number of baseline methods and advanced quantification methods, of routines for quantification-oriented model selection, of several broadly accepted evaluation measures, and of robust evaluation protocols routinely used in the field. QuaPy also makes available datasets commonly used for testing quantifiers, and offers visualization tools for facilitating the analysis and interpretation of the results. The software is open-source and publicly available under a BSD-3 licence via https://github.com/HLT-ISTI/QuaPy, and can be installed via pip (https://pypi.org/project/QuaPy/)
【3】 Conversational Agents in Software Engineering: Survey, Taxonomy and Challenges 标题:软件工程中的会话代理:综述、分类和挑战
作者:Quim Motger,Xavier Franch,Jordi Marco 机构:(ESSI), Universitat Politècnica de Catalunya (UPC), Spain 备注:37 pages, 15 figures, 2 tables, submitted to journal 链接:https://arxiv.org/abs/2106.10901 摘要:通过专门的科学和工业研究,自然语言接口在人机交互领域的使用正在进行深入的研究。该领域的最新贡献,包括递归神经网络等深度学习方法、上下文感知策略的潜力和以用户为中心的设计方法,使社区重新关注基于软件的对话系统,通常称为会话代理或聊天机器人。然而,鉴于该领域的新颖性,缺乏一个通用的、与语境无关的、涵盖所有研究视角的会话主体研究现状综述。在此背景下,本文通过对二次研究的系统文献回顾,对会话主体的研究现状进行了综述。所进行的研究旨在通过在不同领域、研究重点和背景下清晰地呈现最近文献所发表的聚合知识,形成一个详尽的视角。因此,本研究提出了会话主体领域不同维度的整体分类,以期对研究者有所帮助,并为未来自然语言界面领域的研究奠定基础。 摘要:The use of natural language interfaces in the field of human-computer interaction is undergoing intense study through dedicated scientific and industrial research. The latest contributions in the field, including deep learning approaches like recurrent neural networks, the potential of context-aware strategies and user-centred design approaches, have brought back the attention of the community to software-based dialogue systems, generally known as conversational agents or chatbots. Nonetheless, and given the novelty of the field, a generic, context-independent overview on the current state of research of conversational agents covering all research perspectives involved is missing. Motivated by this context, this paper reports a survey of the current state of research of conversational agents through a systematic literature review of secondary studies. The conducted research is designed to develop an exhaustive perspective through a clear presentation of the aggregated knowledge published by recent literature within a variety of domains, research focuses and contexts. As a result, this research proposes a holistic taxonomy of the different dimensions involved in the conversational agents' field, which is expected to help researchers and to lay the groundwork for future research in the field of natural language interfaces.
【4】 Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling 标题:注意:多语言、多领域序列建模中的中心选择
作者:Hongyu Gong,Yun Tang,Juan Pino,Xian Li 机构:Facebook AI Research 链接:https://arxiv.org/abs/2106.10840 摘要:多头注意使每个注意头从输入序列的不同部分收集显著信息,使其成为序列建模的有力机制。多语言和多领域学习是序列建模的常见场景,其中的关键挑战是最大化跨语言和领域的正迁移和负迁移。在本文中,我们发现非选择性的注意力分享是次优的,以实现良好的泛化跨所有语言和领域。我们进一步提出注意共享策略,以促进多语种和多领域序列建模中的参数共享和专业化。我们的方法自动学习不同语言和领域的共享和专门注意头,以减轻它们的干扰。在包括语音识别、文本到文本和语音到文本翻译在内的各种任务中,所提出的注意共享策略始终为建立在多头注意基础上的序列模型带来收益。对于语音到文本的翻译,我们的方法在多语言环境下,平均产生$13$语言方向上的$2.0$BLEU,在多域环境下,平均产生$3$域上的$2.0$BLEU。 摘要:Multi-head attention has each of the attention heads collect salient information from different parts of an input sequence, making it a powerful mechanism for sequence modeling. Multilingual and multi-domain learning are common scenarios for sequence modeling, where the key challenge is to maximize positive transfer and mitigate negative transfer across languages and domains. In this paper, we find that non-selective attention sharing is sub-optimal for achieving good generalization across all languages and domains. We further propose attention sharing strategies to facilitate parameter sharing and specialization in multilingual and multi-domain sequence modeling. Our approach automatically learns shared and specialized attention heads for different languages and domains to mitigate their interference. Evaluated in various tasks including speech recognition, text-to-text and speech-to-text translation, the proposed attention sharing strategies consistently bring gains to sequence models built upon multi-head attention. For speech-to-text translation, our approach yields an average of $ 2.0$ BLEU over $13$ language directions in multilingual setting and $ 2.0$ BLEU over $3$ domains in multi-domain setting.
【5】 Calliar: An Online Handwritten Dataset for Arabic Calligraphy 标题:Calliar:一个用于阿拉伯书法的在线手写数据集
作者:Zaid Alyafeai,Maged S. Al-shaibani,Mustafa Ghaleb,Yousif Ahmed Al-Wajih 机构:Department of Computer Science, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia , Interdisciplinary Research Center, for Intelligent Secure Systems, Department of Systems Engineering 链接:https://arxiv.org/abs/2106.10745 摘要:书法是阿拉伯文化遗产的重要组成部分。过去它被用来装饰房屋和清真寺。通常,这样的书法是由有审美眼光的专家手工设计的。在过去的几年里,有相当大的努力,以数字化这类艺术,要么采取的照片装饰建筑物或绘制他们使用数字设备。后者被认为是一种在线形式,通过在屏幕上记录设备的运动(例如电子笔)来跟踪绘图。在文献中,有许多离线数据集收集了各种各样的阿拉伯风格的书法。然而,没有阿拉伯书法的在线数据集。在本文中,我们说明了我们的方法收集和注释的在线数据集为阿拉伯语书法称为Calliar,其中包括2500句话。Calliar是为笔划、字符、单词和句子级别的预测而注释的。 摘要:Calligraphy is an essential part of the Arabic heritage and culture. It has been used in the past for the decoration of houses and mosques. Usually, such calligraphy is designed manually by experts with aesthetic insights. In the past few years, there has been a considerable effort to digitize such type of art by either taking a photo of decorated buildings or drawing them using digital devices. The latter is considered an online form where the drawing is tracked by recording the apparatus movement, an electronic pen for instance, on a screen. In the literature, there are many offline datasets collected with a diversity of Arabic styles for calligraphy. However, there is no available online dataset for Arabic calligraphy. In this paper, we illustrate our approach for the collection and annotation of an online dataset for Arabic calligraphy called Calliar that consists of 2,500 sentences. Calliar is annotated for stroke, character, word and sentence level prediction.
【6】 TweeNLP: A Twitter Exploration Portal for Natural Language Processing 标题:TweNLP:一个面向自然语言处理的Twitter探索门户
作者:Viraj Shah,Shruti Singh,Mayank Singh 机构:Indian Institute of Technology Gandhinagar, Gujarat, India 备注:ACL-IJCNLP Demo Track 2021 链接:https://arxiv.org/abs/2106.10512 摘要:我们介绍tweenp,一个一站式门户网站,它组织Twitter的自然语言处理(NLP)数据,并构建一个可视化和探索平台。它策划了19395条推特(截至2021年4月),来自各种NLP会议和一般NLP讨论。它支持多种功能,如TweetExplorer,可以按主题浏览tweet,在整个会议的组织周期中可视化Twitter活动的见解,发现流行的研究论文和研究人员。它还建立了会议和研讨会提交截止日期的时间表。我们设想tweenp通过将研究论文的tweet与NLPExplorer科学文献搜索引擎相结合,成为NLP社区的集体记忆单元。当前系统位于http://nlpexplorer.org/twitter/CFP . 摘要:We present TweeNLP, a one-stop portal that organizes Twitter's natural language processing (NLP) data and builds a visualization and exploration platform. It curates 19,395 tweets (as of April 2021) from various NLP conferences and general NLP discussions. It supports multiple features such as TweetExplorer to explore tweets by topics, visualize insights from Twitter activity throughout the organization cycle of conferences, discover popular research papers and researchers. It also builds a timeline of conference and workshop submission deadlines. We envision TweeNLP to function as a collective memory unit for the NLP community by integrating the tweets pertaining to research papers with the NLPExplorer scientific literature search engine. The current system is hosted at http://nlpexplorer.org/twitter/CFP .
【7】 Non-native English lexicon creation for bilingual speech synthesis 标题:面向双语语音合成的非母语英语词汇创设
作者:Arun Baby,Pranav Jawale,Saranya Vinnaitherthan,Sumukh Badam,Nagaraj Adiga,Sharath Adavanne 机构:Zapr Media Labs (Red Brick Lane Marketing Solutions Pvt. Ltd.), India 备注:Accepted for Presentation at Speech Synthesis Workshop (SSW), 2021 (August 2021) 链接:https://arxiv.org/abs/2106.10870 摘要:说英语的双语者把英语作为他们的语言之一。他们的英语是非母语的,他们的对话是代码混合的方式。双语文语转换(TTS)系统对于非英语母语者的可理解性取决于一个能够捕捉非英语母语者使用的音位序列的词汇。然而,由于缺乏非母语英语词汇,现有的双语TTS系统除了使用母语词汇外,还使用了广泛使用的母语英语词汇。由于语音中的非母语英语发音与文本中的母语英语词汇不一致,在这种TTS系统中合成语音的可懂度大大降低。本文的出发点在于说话人的母语对非母语英语发音的影响。我们提出了一种基于字母-音素对齐的规则获取方法,将英语本族语词汇映射到非本族语词汇。这种映射的有效性是通过比较双语(印度英语和印地语)TTS系统训练与不建议的规则。主观评价结果表明,采用本文提出的非母语英语词汇规则训练的双语TTS系统在偏好上获得了6%的绝对提高。 摘要:Bilingual English speakers speak English as one of their languages. Their English is of a non-native kind, and their conversations are of a code-mixed fashion. The intelligibility of a bilingual text-to-speech (TTS) system for such non-native English speakers depends on a lexicon that captures the phoneme sequence used by non-native speakers. However, due to the lack of non-native English lexicon, existing bilingual TTS systems employ native English lexicons that are widely available, in addition to their native language lexicon. Due to the inconsistency between the non-native English pronunciation in the audio and native English lexicon in the text, the intelligibility of synthesized speech in such TTS systems is significantly reduced. This paper is motivated by the knowledge that the native language of the speaker highly influences non-native English pronunciation. We propose a generic approach to obtain rules based on letter to phoneme alignment to map native English lexicon to their non-native version. The effectiveness of such mapping is studied by comparing bilingual (Indian English and Hindi) TTS systems trained with and without the proposed rules. The subjective evaluation shows that the bilingual TTS system trained with the proposed non-native English lexicon rules obtains a 6% absolute improvement in preference.