自然语言处理学术速递[8.23]

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CL 方向，今日共计16篇

Transformer(2篇)

【1】 Smart Bird: Learnable Sparse Attention for Efficient and Effective Transformer 标题：Smart Bird：高效高效Transformer的可学习稀疏注意链接：https://arxiv.org/abs/2108.09193

作者：Chuhan Wu,Fangzhao Wu,Tao Qi,Yongfeng Huang 机构：†Department of Electronic Engineering & BNRist, Tsinghua University, Beijing , China, ‡Microsoft Research Asia, Beijing , China 摘要：Transformer在NLP中取得了巨大成功。然而，Transformer中自我注意机制的二次复杂性使得它在处理长序列时效率低下。许多现有的研究探索通过计算稀疏的自我注意来加速Transformer，而不是密集的自我注意，后者通常关注特定位置的标记或随机选择的标记。但是，手动选择或随机标记对于上下文建模可能没有信息。在本文中，我们提出了智能鸟，这是一个高效和有效的Transformer与学习稀疏注意。在Smart Bird中，我们首先计算一个带有单头低维变换器的略图注意矩阵，其目的是发现令牌之间潜在的重要交互作用。然后，我们根据从绘制的注意矩阵得到的概率分数对标记对进行采样，以生成不同注意头的不同稀疏注意索引矩阵。最后，根据索引矩阵选择标记嵌入，形成稀疏注意网络的输入。在不同任务的六个基准数据集上进行的大量实验验证了Smart Bird在文本建模中的效率和有效性。摘要：Transformer has achieved great success in NLP. However, the quadratic complexity of the self-attention mechanism in Transformer makes it inefficient in handling long sequences. Many existing works explore to accelerate Transformers by computing sparse self-attention instead of a dense one, which usually attends to tokens at certain positions or randomly selected tokens. However, manually selected or random tokens may be uninformative for context modeling. In this paper, we propose Smart Bird, which is an efficient and effective Transformer with learnable sparse attention. In Smart Bird, we first compute a sketched attention matrix with a single-head low-dimensional Transformer, which aims to find potential important interactions between tokens. We then sample token pairs based on their probability scores derived from the sketched attention matrix to generate different sparse attention index matrices for different attention heads. Finally, we select token embeddings according to the index matrices to form the input of sparse attention networks. Extensive experiments on six benchmark datasets for different tasks validate the efficiency and effectiveness of Smart Bird in text modeling.

【2】 Semantic Communication with Adaptive Universal Transformer 标题：自适应万能Transformer的语义通信链接：https://arxiv.org/abs/2108.09119

作者：Qingyang Zhou,Rongpeng Li,Zhifeng Zhao,Chenghui Peng,Honggang Zhang 摘要：随着深度学习（DL）的发展，自然语言处理（NLP）使我们能够分析和理解大量的语言文本。因此，借助NLP，我们可以在噪声信道上通过联合语义信源和信道编码实现语义通信。然而，现有的实现这一目标的方法是使用一个固定的NLP转换器，而忽略每个句子中包含的语义信息的差异。为了解决这个问题，我们提出了一种新的基于通用转换器的语义通信系统。与传统Transformer相比，通用Transformer引入了自适应循环机制。通过引入循环机制，新的语义通信系统可以更灵活地传输不同语义信息的句子，并在各种信道条件下实现更好的端到端性能。摘要：With the development of deep learning (DL), natural language processing (NLP) makes it possible for us to analyze and understand a large amount of language texts. Accordingly, we can achieve a semantic communication in terms of joint semantic source and channel coding over a noisy channel with the help of NLP. However, the existing method to realize this goal is to use a fixed transformer of NLP while ignoring the difference of semantic information contained in each sentence. To solve this problem, we propose a new semantic communication system based on Universal Transformer. Compared with the traditional transformer, an adaptive circulation mechanism is introduced in the Universal Transformer. Through the introduction of the circulation mechanism, the new semantic communication system can be more flexible to transmit sentences with different semantic information, and achieve better end-to-end performance under various channel conditions.

BERT(1篇)

【1】 Extracting Radiological Findings With Normalized Anatomical Information Using a Span-Based BERT Relation Extraction Model 标题：基于SPAN的BERT关系提取模型提取具有归一化解剖信息的放射学发现链接：https://arxiv.org/abs/2108.09211

作者：Kevin Lybarger,Aashka Damani,Martin Gunn,Ozlem Uzuner,Meliha Yetisgen 机构：University of Washington, Seattle, WA, USA; ,George Mason University, Fairfax, VA, USA 摘要：医学成像对于许多医学问题的诊断和治疗至关重要，包括多种形式的癌症。医学成像报告提取放射科医生的发现和观察结果，创建非结构化医学图像的非结构化文本表示。大规模使用这种文本编码信息需要将非结构化文本转换为结构化的语义表示。我们探讨与放射学发现相关的放射学报告中解剖信息的提取和标准化。我们使用基于span的关系提取模型来研究这个提取和规范化任务，该模型使用BERT联合提取实体和关系。本研究考察了影响提取和标准化性能的因素，包括身体部位/器官系统、发生频率、跨度长度和跨度多样性。它讨论了提高性能和创建高质量放射性现象语义表示的方法。摘要：Medical imaging is critical to the diagnosis and treatment of numerous medical problems, including many forms of cancer. Medical imaging reports distill the findings and observations of radiologists, creating an unstructured textual representation of unstructured medical images. Large-scale use of this text-encoded information requires converting the unstructured text to a structured, semantic representation. We explore the extraction and normalization of anatomical information in radiology reports that is associated with radiological findings. We investigate this extraction and normalization task using a span-based relation extraction model that jointly extracts entities and relations using BERT. This work examines the factors that influence extraction and normalization performance, including the body part/organ system, frequency of occurrence, span length, and span diversity. It discusses approaches for improving performance and creating high-quality semantic representations of radiological phenomena.

Graph|知识图谱|Knowledge(3篇)

【1】 SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles 标题：SoMeSci--科学论文中软件提及的五星开放数据金标准知识图谱链接：https://arxiv.org/abs/2108.09070

作者：David Schindler,Felix Bensmann,Stefan Dietze,Frank Krüger 机构：Institute of Communications Engineering, University of, Rostock, Germany, GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany, Heinrich-Heine-University, Düsseldorf, Germany 备注：Preprint of CIKM 2021 Resource Paper, 10 pages 摘要：关于科学调查中使用的软件的知识很重要，原因有几个，例如，为了理解数据处理的来源和方法。然而，软件通常不被正式引用，而是在调查的学术描述中被非正式提及，这增加了自动信息提取和消除歧义的需要。鉴于缺乏可靠的基本事实数据，我们为SoMeSci（科学中的软件提及）提供了科学文章中软件提及的黄金标准知识图。它包含1367篇PubMed中心文章中提到的3756个软件的高质量注释（IRR:$kappa{}.82$）。除了简单地提到软件之外，我们还提供了附加信息的关系标签，例如版本、开发人员、URL或引用。此外，我们区分不同类型，如应用程序、插件或编程环境，以及不同类型的提及，如使用或创建。据我们所知，SoMeSci是关于科学文章中提到的软件的最全面的语料库，为命名实体识别、关系提取、实体消歧和实体链接提供训练样本。最后，我们勾勒出潜在的用例并提供基线结果。摘要：Knowledge about software used in scientific investigations is important for several reasons, for instance, to enable an understanding of provenance and methods involved in data handling. However, software is usually not formally cited, but rather mentioned informally within the scholarly description of the investigation, raising the need for automatic information extraction and disambiguation. Given the lack of reliable ground truth data, we present SoMeSci (Software Mentions in Science) a gold standard knowledge graph of software mentions in scientific articles. It contains high quality annotations (IRR: $kappa{=}.82$) of 3756 software mentions in 1367 PubMed Central articles. Besides the plain mention of the software, we also provide relation labels for additional information, such as the version, the developer, a URL or citations. Moreover, we distinguish between different types, such as application, plugin or programming environment, as well as different types of mentions, such as usage or creation. To the best of our knowledge, SoMeSci is the most comprehensive corpus about software mentions in scientific articles, providing training samples for Named Entity Recognition, Relation Extraction, Entity Disambiguation, and Entity Linking. Finally, we sketch potential use cases and provide baseline results.

【2】 Twitter User Representation using Weakly Supervised Graph Embedding 标题：基于弱监督图嵌入的Twitter用户表示链接：https://arxiv.org/abs/2108.08988

作者：Tunazzina Islam,Dan Goldwasser 机构：Department of Computer Science, Purdue University, West Lafayette, Indiana 备注：accepted at 16th International AAAI Conference on Web and Social Media (ICWSM-2022), direct accept from May 2021 submission, 12 pages 摘要：社交媒体平台为用户提供了方便的手段，让他们参与各种内容的多种在线活动，并创建快速广泛的互动。然而，这种快速增长的访问也增加了信息的多样性，对用户类型进行特征化以了解人们在社交媒体上共享的生活方式决策是一项挑战。本文提出了一种基于弱监督图嵌入的用户类型理解框架。我们通过对Twitter上与幸福感相关的推文进行弱监督来评估用户，重点关注“瑜伽”、“酮类饮食”。在真实数据集上的实验表明，该框架在检测用户类型方面优于基线。最后，我们举例说明从我们的数据集中对不同类型的用户（例如，从业者和促销人员）进行的数据分析。虽然我们关注与生活方式相关的tweet（即瑜伽、keto），但我们构建用户表示的方法很容易推广到其他领域。摘要：Social media platforms provide convenient means for users to participate in multiple online activities on various contents and create fast widespread interactions. However, this rapidly growing access has also increased the diverse information, and characterizing user types to understand people's lifestyle decisions shared in social media is challenging. In this paper, we propose a weakly supervised graph embedding based framework for understanding user types. We evaluate the user embedding learned using weak supervision over well-being related tweets from Twitter, focusing on 'Yoga', 'Keto diet'. Experiments on real-world datasets demonstrate that the proposed framework outperforms the baselines for detecting user types. Finally, we illustrate data analysis on different types of users (e.g., practitioner vs. promotional) from our dataset. While we focus on lifestyle-related tweets (i.e., yoga, keto), our method for constructing user representation readily generalizes to other domains.

【3】 SMedBERT: A Knowledge-Enhanced Pre-trained Language Model with Structured Semantics for Medical Text Mining 标题：SMedBERT：一种面向医学文本挖掘的具有结构化语义的知识增强型预训练语言模型链接：https://arxiv.org/abs/2108.08983

作者：Taolin Zhang,Zerui Cai,Chengyu Wang,Minghui Qiu,Bite Yang,Xiaofeng He 机构： School of Software Engineering, East China Normal University , Alibaba Group, Shanghai Key Laboratory of Trsustworthy Computing, School of Computer Science and Technology, East China Normal University , DXY 备注：ACL2021 摘要：近年来，通过注入知识事实来提高预训练语言模型的语言理解能力，预训练语言模型的性能得到了显著提高。对于医学领域来说，背景知识源尤其有用，因为大量的医学术语及其复杂的关系难以在文本中理解。在这项工作中，我们介绍了SMedBERT，一种在大规模医学语料库上训练的医学PLM，它融合了来自链接实体邻域的深层结构语义知识。在SMedBERT中，提出了提及邻域混合注意来学习异构实体信息，它将实体类型的语义表示注入到同构的相邻实体结构中。除了作为外部特征的知识集成外，我们还建议利用知识图中链接实体的邻居作为文本提及的附加全局上下文，允许它们通过共享邻居进行通信，从而丰富其语义表示。实验表明，在各种知识密集型中医任务中，斯梅德BERT显著优于强基线。它还提高了其他任务的性能，如问题回答、问题匹配和自然语言推理。摘要：Recently, the performance of Pre-trained Language Models (PLMs) has been significantly improved by injecting knowledge facts to enhance their abilities of language understanding. For medical domains, the background knowledge sources are especially useful, due to the massive medical terms and their complicated relations are difficult to understand in text. In this work, we introduce SMedBERT, a medical PLM trained on large-scale medical corpora, incorporating deep structured semantic knowledge from neighbors of linked-entity.In SMedBERT, the mention-neighbor hybrid attention is proposed to learn heterogeneous-entity information, which infuses the semantic representations of entity types into the homogeneous neighboring entity structure. Apart from knowledge integration as external features, we propose to employ the neighbors of linked-entities in the knowledge graph as additional global contexts of text mentions, allowing them to communicate via shared neighbors, thus enrich their semantic representations. Experiments demonstrate that SMedBERT significantly outperforms strong baselines in various knowledge-intensive Chinese medical tasks. It also improves the performance of other tasks such as question answering, question matching and natural language inference.

GAN|对抗|攻击|生成相关(2篇)

【1】 A Neural Conversation Generation Model via Equivalent Shared Memory Investigation 标题：一种基于等效共享存储器调查的神经会话生成模型链接：https://arxiv.org/abs/2108.09164

作者：Changzhen Ji,Yating Zhang,Xiaozhong Liu,Adam Jatowt,Changlong Sun,Conghui Zhu,Tiejun Zhao 机构：Harbin Institute of Technology, Harbin, China, Alibaba Group, Hangzhou, China, Worcester Polytechnic Institute, Worcester, Massachusetts, USA, University of Innsbruck, Innsbruck, Austria 摘要：会话生成作为自然语言生成（NLG）中一项具有挑战性的任务，近年来越来越受到人们的关注。最近的一些作品采用了顺序到顺序的结构以及外部知识，这成功地提高了生成对话的质量。然而，很少有作品利用从类似对话中提取的知识来生成话语。以客户服务和法庭辩论领域的对话为例，很明显，基本实体/短语及其相关逻辑和相互关系可以从类似的对话实例中提取和借用。这些信息可以为改进会话生成提供有用的信号。在本文中，我们提出了一种新的阅读和记忆框架，称为深度阅读记忆网络（DRMN），它能够记忆相似对话中的有用信息，以提高话语生成。我们将我们的模型应用于司法和电子商务领域的两个大型会话数据集。实验证明，该模型优于现有的方法。摘要：Conversation generation as a challenging task in Natural Language Generation (NLG) has been increasingly attracting attention over the last years. A number of recent works adopted sequence-to-sequence structures along with external knowledge, which successfully enhanced the quality of generated conversations. Nevertheless, few works utilized the knowledge extracted from similar conversations for utterance generation. Taking conversations in customer service and court debate domains as examples, it is evident that essential entities/phrases, as well as their associated logic and inter-relationships can be extracted and borrowed from similar conversation instances. Such information could provide useful signals for improving conversation generation. In this paper, we propose a novel reading and memory framework called Deep Reading Memory Network (DRMN) which is capable of remembering useful information of similar conversations for improving utterance generation. We apply our model to two large-scale conversation datasets of justice and e-commerce fields. Experiments prove that the proposed model outperforms the state-of-the-art approaches.

【2】 CIGLI: Conditional Image Generation from Language & Image 标题：Cigli：从语言和图像生成条件图像链接：https://arxiv.org/abs/2108.08955

作者：Xiaopeng Lu,Lynnette Ng,Jared Fernandez,Hao Zhu 机构：Carnegie Mellon University 备注：5 pages 摘要：近年来，多模式发电得到了广泛的探索。当前的研究方向包括基于图像生成文本，反之亦然。在本文中，我们提出了一个新的任务称为CIGLI：从语言和图像生成条件图像。与文本图像生成中基于文本生成图像不同，此任务需要从文本描述和图像提示生成图像。我们设计了一个新的数据集，以确保文本描述描述来自两个图像的信息，并且仅分析描述不足以生成图像。然后，我们提出了一种新的语言图像融合模型，该模型通过定量（自动）和定性（人工）评估提高了两种已建立的基线方法的性能。代码和数据集可在https://github.com/vincentlux/CIGLI. 摘要：Multi-modal generation has been widely explored in recent years. Current research directions involve generating text based on an image or vice versa. In this paper, we propose a new task called CIGLI: Conditional Image Generation from Language and Image. Instead of generating an image based on text as in text-image generation, this task requires the generation of an image from a textual description and an image prompt. We designed a new dataset to ensure that the text description describes information from both images, and that solely analyzing the description is insufficient to generate an image. We then propose a novel language-image fusion model which improves the performance over two established baseline methods, as evaluated by quantitative (automatic) and qualitative (human) evaluations. The code and dataset is available at https://github.com/vincentlux/CIGLI.

Word2Vec|文本|单词(3篇)

【1】 Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling 标题：本地化、分组和选择：增强文本-按场景进行VQA文本建模链接：https://arxiv.org/abs/2108.08965

作者：Xiaopeng Lu,Zhen Fan,Yansen Wang,Jean Oh,Carolyn P. Rose 机构：Language Technologies Institute, Carnegie Mellon University, Forbes Avenue, Pittsburgh PA 备注：9 pages 摘要：作为多模态语境理解中的一项重要任务，文本答疑（VisualQuestionAnswering）旨在通过阅读图像中的文本信息来回答问题。它与原始的VQA任务不同，因为文本VQA除了跨模态接地功能外，还需要大量的场景-文本关系理解。在本文中，我们提出了本地化、分组和选择（logo）的新模型，试图从多个方面解决这个问题。LOGOS利用两个基础任务更好地定位图像的关键信息，利用场景文本聚类对单个OCR标记进行分组，并学习从不同来源的OCR（光学字符识别）文本中选择最佳答案。实验表明，在不使用额外OCR注释数据的情况下，Logo在两个文本VQA基准上的性能优于以前最先进的方法。烧蚀研究和分析表明，徽标能够连接不同的模式，更好地理解场景文本。摘要：As an important task in multimodal context understanding, Text-VQA (Visual Question Answering) aims at question answering through reading text information in images. It differentiates from the original VQA task as Text-VQA requires large amounts of scene-text relationship understanding, in addition to the cross-modal grounding capability. In this paper, we propose Localize, Group, and Select (LOGOS), a novel model which attempts to tackle this problem from multiple aspects. LOGOS leverages two grounding tasks to better localize the key information of the image, utilizes scene text clustering to group individual OCR tokens, and learns to select the best answer from different sources of OCR (Optical Character Recognition) texts. Experiments show that LOGOS outperforms previous state-of-the-art methods on two Text-VQA benchmarks without using additional OCR annotation data. Ablation studies and analysis demonstrate the capability of LOGOS to bridge different modalities and better understand scene text.

【2】 A Framework for Neural Topic Modeling of Text Corpora 标题：一种面向文本语料库的神经主题建模框架链接：https://arxiv.org/abs/2108.08946

作者：Shayan Fazeli,Majid Sarrafzadeh 摘要：主题建模指的是发现文本数据语料库中出现的主要主题的问题，解决方案在许多领域找到了关键的应用。在这项工作中，受自然语言处理领域最新进展的启发，我们引入了FAME，这是一个开源框架，能够有效地提取和合并文本特征，并利用它们发现主题和聚类语料库中语义相似的文本文档。这些特性包括从传统方法（例如，基于频率的）到最新的基于转换器的语言模型（如BERT模型族）的自动编码嵌入。为了证明该库的有效性，我们在著名的新闻组数据集上进行了实验。该图书馆可在网上查阅。摘要：Topic Modeling refers to the problem of discovering the main topics that have occurred in corpora of textual data, with solutions finding crucial applications in numerous fields. In this work, inspired by the recent advancements in the Natural Language Processing domain, we introduce FAME, an open-source framework enabling an efficient mechanism of extracting and incorporating textual features and utilizing them in discovering topics and clustering text documents that are semantically similar in a corpus. These features range from traditional approaches (e.g., frequency-based) to the most recent auto-encoding embeddings from transformer-based language models such as BERT model family. To demonstrate the effectiveness of this library, we conducted experiments on the well-known News-Group dataset. The library is available online.

【3】 Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models 标题：语句-T5：来自预先训练的文本到文本模型的可伸缩语句编码器链接：https://arxiv.org/abs/2108.08877

作者：Jianmo Ni,Gustavo Hernández {Á}brego,Noah Constant,Ji Ma,Keith B. Hall,Daniel Cer,Yinfei Yang 机构：Google Research, Mountain View, CA 摘要：我们提供了文本到文本转换器（T5）句子嵌入的首次探索。句子嵌入在语言处理任务中非常有用。虽然T5在语言任务中取得了令人印象深刻的性能，如序列到序列映射问题，但如何从编码器-解码器模型生成句子嵌入尚不清楚。我们研究了三种提取T5语句嵌入的方法：两种仅使用T5编码器，一种使用完整的T5编码器-解码器模型。我们的纯编码器模型在传输任务和语义文本相似性（STS）方面都优于基于BERT的句子嵌入。我们的编解码方法在STS上实现了进一步的改进。我们发现，将T5从数百万个参数扩展到数十亿个参数可以持续改进下游任务。最后，我们介绍了一种两阶段对比学习方法，该方法使用句子嵌入实现了STS的一种新技术，优于句子BERT和SimCSE。摘要：We provide the first exploration of text-to-text transformers (T5) sentence embeddings. Sentence embeddings are broadly useful for language processing tasks. While T5 achieves impressive performance on language tasks cast as sequence-to-sequence mapping problems, it is unclear how to produce sentence embeddings from encoder-decoder models. We investigate three methods for extracting T5 sentence embeddings: two utilize only the T5 encoder and one uses the full T5 encoder-decoder model. Our encoder-only models outperforms BERT-based sentence embeddings on both transfer tasks and semantic textual similarity (STS). Our encoder-decoder method achieves further improvement on STS. Scaling up T5 from millions to billions of parameters is found to produce consistent improvements on downstream tasks. Finally, we introduce a two-stage contrastive learning approach that achieves a new state-of-art on STS using sentence embeddings, outperforming both Sentence BERT and SimCSE.

其他神经网络|深度学习|模型|建模(1篇)

【1】 Open Relation Modeling: Learning to Define Relations between Entities 标题：开放式关系建模：学习定义实体之间的关系链接：https://arxiv.org/abs/2108.09241

作者：Jie Huang,Kevin Chen-Chuan Chang,Jinjun Xiong,Wen-mei Hwu 机构：University of Illinois at Urbana-Champaign, USA, University at Buffalo 摘要：实体之间的关系可以由不同的实例表示，例如，包含两个实体的句子或知识图（KG）中的事实。然而，这些实例可能无法很好地捕捉实体之间的一般关系，可能难以被人类理解，甚至可能由于知识源的不完整而无法找到。在本文中，我们介绍了开放关系建模任务-给定两个实体，生成一个描述它们之间关系的连贯句子。为了解决这个问题，我们建议通过让机器学习实体的定义，来教机器生成类似定义的关系描述。具体而言，我们对预训练语言模型（PLM）进行微调，以生成以提取的实体对为条件的定义。为了帮助PLMs在实体之间进行推理，并为PLMs提供开放关系建模所需的额外关系知识，我们将推理路径合并到KGs中，并包括推理路径选择机制。我们发现，PLMs可以通过置信度估计来选择可解释的和信息丰富的推理路径，并且所选择的路径可以引导PLMs生成更好的关系描述。实验结果表明，我们的模型能够生成简洁而信息丰富的关系描述，能够捕捉实体和关系的代表性特征。摘要：Relations between entities can be represented by different instances, e.g., a sentence containing both entities or a fact in a Knowledge Graph (KG). However, these instances may not well capture the general relations between entities, may be difficult to understand by humans, even may not be found due to the incompleteness of the knowledge source. In this paper, we introduce the Open Relation Modeling task - given two entities, generate a coherent sentence describing the relation between them. To solve this task, we propose to teach machines to generate definition-like relation descriptions by letting them learn from definitions of entities. Specifically, we fine-tune Pre-trained Language Models (PLMs) to produce definitions conditioned on extracted entity pairs. To help PLMs reason between entities and provide additional relational knowledge to PLMs for open relation modeling, we incorporate reasoning paths in KGs and include a reasoning path selection mechanism. We show that PLMs can select interpretable and informative reasoning paths by confidence estimation, and the selected path can guide PLMs to generate better relation descriptions. Experimental results show that our model can generate concise but informative relation descriptions that capture the representative characteristics of entities and relations.

其他(4篇)

【1】 Group-based Distinctive Image Captioning with Memory Attention 标题：具有记忆注意的基于分组的区别化图像字幕链接：https://arxiv.org/abs/2108.09151

作者：Jiuniu Wang,Wenjia Xu,Qingzhong Wang,Antoni B. Chan 机构： Department of Computer Science, City University of Hong Kong, Aerospace Information Research Institute, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Baidu Research 备注：Accepted at ACM MM 2021 (oral) 摘要：使用自然语言描述图像被广泛称为图像字幕，由于计算机视觉和自然语言生成技术的发展，它已经取得了一致的进展。尽管传统的字幕模型基于流行的指标（即BLEU、CIDEr和SPICE）实现了高精度，但字幕区分目标图像和其他类似图像的能力还没有得到充分的研究。为了产生独特的字幕，一些先驱采用对比学习或重新加权的地面真相字幕，重点是一个单一的输入图像。但是，类似图像组中对象之间的关系（例如，同一相册中的项目或属性或细粒度事件）被忽略。在本文中，我们使用基于组的独特字幕模型（GdisCap）来改进图像字幕的独特性，该模型将每个图像与一个相似组中的其他图像进行比较，并突出每个图像的独特性。特别是，我们提出了一个基于组的记忆注意（GMA）模块，该模块存储图像组中唯一的对象特征（即，与其他图像中的对象相似性较低）。生成字幕时，这些独特的对象特征会高亮显示，从而产生更独特的字幕。此外，在地面实况字幕中选择不同的词语来监督语言解码器和GMA。最后，我们提出了一个新的评价指标，独特词率（DisWordRate）来衡量字幕的独特性。定量结果表明，该方法显著提高了多个基线模型的显著性，并在准确性和显著性方面达到了最新水平。用户研究的结果与定量评估结果一致，并证明了新度量失言率的合理性。摘要：Describing images using natural language is widely known as image captioning, which has made consistent progress due to the development of computer vision and natural language generation techniques. Though conventional captioning models achieve high accuracy based on popular metrics, i.e., BLEU, CIDEr, and SPICE, the ability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employ contrastive learning or re-weighted the ground-truth captions, which focuses on one single input image. However, the relationships between objects in a similar image group (e.g., items or properties within the same album or fine-grained events) are neglected. In this paper, we improve the distinctiveness of image captions using a Group-based Distinctive Captioning Model (GdisCap), which compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we propose a group-based memory attention (GMA) module, which stores object features that are unique among the image group (i.e., with low similarity to objects in other images). These unique object features are highlighted when generating captions, resulting in more distinctive captions. Furthermore, the distinctive words in the ground-truth captions are selected to supervise the language decoder and GMA. Finally, we propose a new evaluation metric, distinctive word rate (DisWordRate) to measure the distinctiveness of captions. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves the state-of-the-art performance on both accuracy and distinctiveness. Results of a user study agree with the quantitative evaluation and demonstrate the rationality of the new metric DisWordRate.

【2】 Airbert: In-domain Pretraining for Vision-and-Language Navigation 标题：Airbert：视觉和语言导航的域内预训练链接：https://arxiv.org/abs/2108.09105

作者：Pierre-Louis Guhur,Makarand Tapaswi,Shizhe Chen,Ivan Laptev,Cordelia Schmid 机构：Inria, ´Ecole normale sup´erieure, CNRS, PSL Research University, Paris, France, IIIT Hyderabad, India 备注：To be published on ICCV 2021. Webpage is at this https URL linking to our dataset, codes and models 摘要：视觉和语言导航（VLN）的目标是使嵌入式代理能够使用自然语言指令在现实环境中导航。由于缺乏特定领域的训练数据以及图像和语言输入的高度多样性，VLN代理在未知环境中的推广仍然具有挑战性。最近的方法探索预训练以提高泛化能力，然而，使用通用图像字幕数据集或现有的小规模VLN环境是次优的，并且导致有限的改进。在这项工作中，我们介绍了BnB，一个大规模的和多样化的域内VLN数据集。我们首先从在线租赁市场的数十万个列表中收集图像标题（IC）对。使用IC对，我们接下来提出自动生成数百万VLN路径指令（PI）对的策略。我们进一步提出了一种洗牌损失，它改进了PI对内时序的学习。我们使用BnB pretrain我们的Airbert模型，该模型可以适应区分性和生成性设置，并表明它在房间到房间（R2R）导航和远程引用表达（幻想）基准方面优于最先进的水平。此外，我们的域内预训练显著提高了具有挑战性的少数镜头VLN评估的性能，我们仅根据少数几家公司的VLN指令训练模型。摘要：Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent methods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small-scale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB, a large-scale and diverse in-domain VLN dataset. We first collect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal order inside PI pairs. We use BnB pretrain our Airbert model that can be adapted to discriminative and generative settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a challenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses.

【3】 GEDIT: Geographic-Enhanced and Dependency-Guided Tagging for Joint POI and Accessibility Extraction at Baidu Maps 标题：Gedit：用于百度地图联合POI和可访问性提取的地理增强和依赖引导标签链接：https://arxiv.org/abs/2108.09104

作者：Yibo Sun,Jizhou Huang,Chunyuan Yuan,Miao Fan,Haifeng Wang,Ming Liu,Bing Qin 机构：Baidu Inc., Beijing, China, Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, China 备注：Accepted by CIKM'21 摘要：及时提供兴趣点（POI）的可访问性提醒对于提高用户查找地点和做出访问决策的满意度起着至关重要的作用。然而，由于业务变化的动态性，很难使POI数据库与现实世界的对应数据库保持同步。为了缓解这个问题，我们制定并提出了一个实用的解决方案，联合提取POI提及并从非结构化文本中识别其耦合的可访问性标签。我们将此任务视为序列标记问题，目标是从非结构化文本生成对。这项任务具有挑战性，因为存在两个主要问题：（1）POI名称通常是新创造的单词，以便成功注册新实体或品牌；（2）文本中可能存在多对，这需要处理一对多或多对一映射，以使每个POI与其可访问性标签相结合。为此，我们提出了一个地理增强和依赖引导的序列标记（GEDIT）模型来同时解决这两个挑战。首先，为了缓解挑战#1，我们开发了一个地理增强的预训练模型来学习文本表示。其次，为了缓解挑战#2，我们应用关系图卷积网络从解析的依赖树学习树节点表示。最后，我们通过将预先学习的表示集成并反馈到CRF层，构建了一个神经序列标记模型。在真实数据集上进行的大量实验证明了GEDIT的优越性和有效性。此外，它已经部署在百度地图的生产中。统计数据表明，提出的解决方案可以节省大量的人力和人力成本来处理相同数量的文档，这证实了它是POI可访问性维护的一种实用方法。摘要：Providing timely accessibility reminders of a point-of-interest (POI) plays a vital role in improving user satisfaction of finding places and making visiting decisions. However, it is difficult to keep the POI database in sync with the real-world counterparts due to the dynamic nature of business changes. To alleviate this problem, we formulate and present a practical solution that jointly extracts POI mentions and identifies their coupled accessibility labels from unstructured text. We approach this task as a sequence tagging problem, where the goal is to producepairs from unstructured text. This task is challenging because of two main issues: (1) POI names are often newly-coined words so as to successfully register new entities or brands and (2) there may exist multiple pairs in the text, which necessitates dealing with one-to-many or many-to-one mapping to make each POI coupled with its accessibility label. To this end, we propose a Geographic-Enhanced and Dependency-guIded sequence Tagging (GEDIT) model to concurrently address the two challenges. First, to alleviate challenge #1, we develop a geographic-enhanced pre-trained model to learn the text representations. Second, to mitigate challenge #2, we apply a relational graph convolutional network to learn the tree node representations from the parsed dependency tree. Finally, we construct a neural sequence tagging model by integrating and feeding the previously pre-learned representations into a CRF layer. Extensive experiments conducted on a real-world dataset demonstrate the superiority and effectiveness of GEDIT. In addition, it has already been deployed in production at Baidu Maps. Statistics show that the proposed solution can save significant human effort and labor costs to deal with the same amount of documents, which confirms that it is a practical way for POI accessibility maintenance.

【4】 Fastformer: Additive Attention is All You Need 标题：快车人：额外的注意力就是你需要的全部链接：https://arxiv.org/abs/2108.09084

作者：Chuhan Wu,Fangzhao Wu,Tao Qi,Yongfeng Huang 机构：†Department of Electronic Engineering & BNRist, Tsinghua University, Beijing , China, ‡Microsoft Research Asia, Beijing , China 摘要：Transformer是一个强大的文本理解模型。然而，由于其对输入序列长度的二次复杂度，它是低效的。虽然有许多方法可以加速Transformer，但它们要么在长序列上效率低下，要么不够有效。在本文中，我们提出了Fastformer，这是一种基于附加注意的高效Transformer模型。在Fastformer中，我们没有对令牌之间的成对交互进行建模，而是首先使用附加注意机制对全局上下文进行建模，然后根据每个令牌表示与全局上下文表示的交互对其进行进一步转换。通过这种方式，Fastformer可以实现线性复杂度的有效上下文建模。在五个数据集上的大量实验表明，Fastformer比许多现有的Transformer模型效率更高，同时可以实现相当甚至更好的长文本建模性能。摘要：Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on long sequences or not effective enough. In this paper, we propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.

linux https python 网络安全 macos

0 人点赞