自然语言处理学术速递[8.19]

Update！H5支持摘要折叠，体验更佳！点击阅读原文访问arxivdaily.com，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏等功能！

cs.CL 方向，今日共计21篇

QA|VQA|问答|对话(1篇)

【1】 MeDiaQA: A Question Answering Dataset on Medical Dialogues 标题：MeDiaQA：医学对话问答数据集链接：https://arxiv.org/abs/2108.08074

作者：Huqun Suri,Qi Zhang,Wenhua Huo,Yan Liu,Chunsheng Guan 机构：Institute of Science and Technology, Taikang Insurance Group 摘要：在本文中，我们介绍了MeDiaQA，一种新的问答（QA）数据集，它构建在真实的在线医疗对话上。它包含22k个由人类注释的选择题，用于患者和医生之间的11k多个对话和120k个话语，涵盖150个疾病专业，这些问题来自haodf.com和dxy.com。MeDiaQA是第一个对医学对话，尤其是定量内容进行推理的QA数据集。该数据集有可能测试跨多回合对话的模型的计算、推理和理解能力，这与现有数据集相比具有挑战性。为了应对这些挑战，我们设计了媒体BERT，其准确率达到64.3%，而人因绩效的准确率达到93%，这表明仍有很大的改进空间。摘要：In this paper, we introduce MeDiaQA, a novel question answering(QA) dataset, which constructed on real online Medical Dialogues. It contains 22k multiple-choice questions annotated by human for over 11k dialogues with 120k utterances between patients and doctors, covering 150 specialties of diseases, which are collected from haodf.com and dxy.com. MeDiaQA is the first QA dataset where reasoning over medical dialogues, especially their quantitative contents. The dataset has the potential to test the computing, reasoning and understanding ability of models across multi-turn dialogues, which is challenging compared with the existing datasets. To address the challenges, we design MeDia-BERT, and it achieves 64.3% accuracy, while human performance of 93% accuracy, which indicates that there still remains a large room for improvement.

语义分析(1篇)

【1】 TSI: an Ad Text Strength Indicator using Text-to-CTR and Semantic-Ad-Similarity 标题：TSI：一种基于文本到CTR和语义广告相似度的广告文本强度指标链接：https://arxiv.org/abs/2108.08226

作者：Shaunak Mishra,Changwei Hu,Manisha Verma,Kevin Yen,Yifan Hu,Maxim Sviridenko 机构：Yahoo Research, USA 备注：Accepted for publication at CIKM 2021 摘要：制定有效的广告文本是一个耗时的过程，对于广告经验有限的小企业来说尤其具有挑战性。当一个没有经验的广告商在广告板上写了一篇糟糕的广告文本时，广告平台有机会检测到表现不佳的广告文本，并提供改进建议。为了实现这一机会，我们提出了一种广告文本强度指标（TSI），它：（i）预测输入广告文本的点击率（CTR），（ii）获取类似的现有广告以在输入广告周围创建一个邻域，（iii）并比较邻域中预测的CTR，以声明输入广告是强还是弱。此外，作为广告文本改进的建议，TSI在社区中显示了高级广告的匿名版本（较高的预测CTR）。对于（i），我们提出了一个基于BERT的文本到CTR模型，该模型根据与广告文本相关联的印象和点击进行训练。对于（ii），我们提出了一个基于句子的语义广告相似性模型，该模型使用来自广告活动设置数据的弱标签进行训练。离线实验表明，与基于词包的基线相比，我们基于BERT的文本到CTR模型在冷启动（新）广告商的CTR预测AUC方面实现了显著提升。此外，我们的语义-文本相似度模型在相似广告检索中取得了较好的效果precision@10.93（用于从同一产品类别检索广告）；与无监督的TF-IDF、word2vec和句子基线相比，这要高得多。最后，我们分享了雅虎（Verizon Media）广告平台上广告商的有希望的在线结果，在该平台上，TSI的一种变体实现了亚秒的端到端延迟。摘要：Coming up with effective ad text is a time consuming process, and particularly challenging for small businesses with limited advertising experience. When an inexperienced advertiser onboards with a poorly written ad text, the ad platform has the opportunity to detect low performing ad text, and provide improvement suggestions. To realize this opportunity, we propose an ad text strength indicator (TSI) which: (i) predicts the click-through-rate (CTR) for an input ad text, (ii) fetches similar existing ads to create a neighborhood around the input ad, (iii) and compares the predicted CTRs in the neighborhood to declare whether the input ad is strong or weak. In addition, as suggestions for ad text improvement, TSI shows anonymized versions of superior ads (higher predicted CTR) in the neighborhood. For (i), we propose a BERT based text-to-CTR model trained on impressions and clicks associated with an ad text. For (ii), we propose a sentence-BERT based semantic-ad-similarity model trained using weak labels from ad campaign setup data. Offline experiments demonstrate that our BERT based text-to-CTR model achieves a significant lift in CTR prediction AUC for cold start (new) advertisers compared to bag-of-words based baselines. In addition, our semantic-textual-similarity model for similar ads retrieval achieves a precision@1 of 0.93 (for retrieving ads from the same product category); this is significantly higher compared to unsupervised TF-IDF, word2vec, and sentence-BERT baselines. Finally, we share promising online results from advertisers in the Yahoo (Verizon Media) ad platform where a variant of TSI was implemented with sub-second end-to-end latency.

Graph|知识图谱|Knowledge(1篇)

【1】 GGP: A Graph-based Grouping Planner for Explicit Control of Long Text Generation 标题：GGP：基于图的长文本生成显式控制分组规划器链接：https://arxiv.org/abs/2108.07998

作者：Xuming Lin,Shaobo Cui,Zhongzhou Zhao,Wei Zhou,Ji Zhang,Haiqing Chen 机构：Alibaba Group 摘要：现有的数据驱动方法可以很好地处理短文本生成。然而，当应用于长文本生成场景（如商业场景中的故事生成或广告文本生成）时，这些方法可能会生成不合逻辑和不可控的文本。为了解决上述问题，我们根据先计划后生成的思想，提出了一种基于图的分组规划器（GGP）。具体来说，给定一组关键短语，GGP首先将这些短语分别编码为实例级顺序表示和语料库级基于图的表示。通过这两种协同表示，我们将这些短语重新组合成一个细粒度的计划，并在此基础上生成最终的长文本。我们在三个长文本生成数据集上进行了实验，实验结果表明GGP显著优于基线，这证明GGP可以通过知道如何说以及以什么顺序来控制长文本生成。摘要：Existing data-driven methods can well handle short text generation. However, when applied to the long-text generation scenarios such as story generation or advertising text generation in the commercial scenario, these methods may generate illogical and uncontrollable texts. To address these aforementioned issues, we propose a graph-based grouping planner(GGP) following the idea of first-plan-then-generate. Specifically, given a collection of key phrases, GGP firstly encodes these phrases into an instance-level sequential representation and a corpus-level graph-based representation separately. With these two synergic representations, we then regroup these phrases into a fine-grained plan, based on which we generate the final long text. We conduct our experiments on three long text generation datasets and the experimental results reveal that GGP significantly outperforms baselines, which proves that GGP can control the long text generation by knowing how to say and in what order.

摘要|信息提取(1篇)

【1】 CUSTOM: Aspect-Oriented Product Summarization for E-Commerce 标题：定制：面向方面的电子商务产品摘要链接：https://arxiv.org/abs/2108.08010

作者：Jiahui Liang,Junwei Bao,Yifan Wang,Youzheng Wu,Xiaodong He,Bowen Zhou 机构：JD AI Research, Beijing, China 备注：12 pages, 4 figures and 6 tables 摘要：产品摘要旨在自动生成具有巨大商业潜力的产品描述。考虑到客户对不同产品方面的偏好，生成面向方面的定制摘要将使其受益匪浅。然而，传统系统通常侧重于提供一般的产品摘要，这可能会错过将产品与客户利益相匹配的机会。为了解决这个问题，我们提出了面向方面的定制电子商务产品摘要，它针对不同的产品方面生成不同的、可控的摘要。为了支持定制研究和进一步研究这一领域，我们构建了两个中文数据集，即智能手机和计算机，分别包括12118/11497真实世界商业产品的76279/49280简短摘要。此外，我们还介绍了EXT，一个用于定制的提取增强生成框架，本文实现了两个著名的序列到序列模型。我们对两个提议的定制数据集进行了广泛的实验，并展示了两个著名的基线模型和EXT的结果，这表明EXT可以生成多样化、高质量和一致的摘要。摘要：Product summarization aims to automatically generate product descriptions, which is of great commercial potential. Considering the customer preferences on different product aspects, it would benefit from generating aspect-oriented customized summaries. However, conventional systems typically focus on providing general product summaries, which may miss the opportunity to match products with customer interests. To address the problem, we propose CUSTOM, aspect-oriented product summarization for e-commerce, which generates diverse and controllable summaries towards different product aspects. To support the study of CUSTOM and further this line of research, we construct two Chinese datasets, i.e., SMARTPHONE and COMPUTER, including 76,279 / 49,280 short summaries for 12,118 / 11,497 real-world commercial products, respectively. Furthermore, we introduce EXT, an extraction-enhanced generation framework for CUSTOM, where two famous sequence-to-sequence models are implemented in this paper. We conduct extensive experiments on the two proposed datasets for CUSTOM and show results of two famous baseline models and EXT, which indicates that EXT can generate diverse, high-quality, and consistent summaries.

推理|分析|理解|解释(1篇)

【1】 EviDR: Evidence-Emphasized Discrete Reasoning for Reasoning Machine Reading Comprehension 标题：EviDR：基于证据的推理机阅读理解离散推理链接：https://arxiv.org/abs/2108.07994

作者：Yongwei Zhou,Junwei Bao,Haipeng Sun,Jiahui Liang,Youzheng Wu,Xiaodong He,Bowen Zhou,Tiejun Zhao 机构： Harbin Institute of Technology, Harbin, China, JD AI Research, Beijing, China 备注：12 pages, 1 figure and 5 tables 摘要：推理机阅读理解（R-MRC）旨在回答需要基于文本进行离散推理的复杂问题。为了支持离散推理，证据（通常是描述问题相关事实的简明文本片段，包括主题实体和属性值）是从问题到答案的关键线索。然而，以前实现最先进性能的端到端方法很少通过对证据建模给予足够重视来解决问题，从而错过了进一步提高R-MRC模型推理能力的机会。为了缓解上述问题，本文提出了一种基于证据的离散推理方法（EviDR），该方法首先在远程监控的基础上检测句子和子句级证据，然后用它驱动一个用关系异构图卷积网络实现的推理模块来推导出答案。在DROP（段落离散推理）数据集上进行了大量的实验，结果证明了我们所提方法的有效性。此外，定性分析验证了所提出的证据强调的离散推理对R-MRC的能力。摘要：Reasoning machine reading comprehension (R-MRC) aims to answer complex questions that require discrete reasoning based on text. To support discrete reasoning, evidence, typically the concise textual fragments that describe question-related facts, including topic entities and attribute values, are crucial clues from question to answer. However, previous end-to-end methods that achieve state-of-the-art performance rarely solve the problem by paying enough emphasis on the modeling of evidence, missing the opportunity to further improve the model's reasoning ability for R-MRC. To alleviate the above issue, in this paper, we propose an evidence-emphasized discrete reasoning approach (EviDR), in which sentence and clause level evidence is first detected based on distant supervision, and then used to drive a reasoning module implemented with a relational heterogeneous graph convolutional network to derive answers. Extensive experiments are conducted on DROP (discrete reasoning over paragraphs) dataset, and the results demonstrate the effectiveness of our proposed approach. In addition, qualitative analysis verifies the capability of the proposed evidence-emphasized discrete reasoning for R-MRC.

GAN|对抗|攻击|生成相关(2篇)

【1】 Table Caption Generation in Scholarly Documents Leveraging Pre-trained Language Models 标题：利用预先训练的语言模型在学术文档中生成表格标题链接：https://arxiv.org/abs/2108.08111

作者：Junjie H. Xu,Kohei Shinden,Makoto P. Kato 机构：University of Tsukuba, Ibaraki, Japan 备注：None 摘要：本文讨论了为学术文档生成表标题的问题，这通常需要表外的附加信息。为此，我们提出了一种从正文中检索相关句子的方法，并将表格内容以及检索到的句子输入到预先训练的语言模型（例如T5和GPT-2）中，以生成表格标题。本文的主要贡献有：（1）讨论了学术文献表格标题的挑战(2）开发数据集DocBank TB，可公开获取；（3）比较不同检索策略的学术文献标题生成方法。我们的实验结果表明，T5是这个任务更好的生成模型，因为它在BLEU和METEOR中优于GPT-2，这意味着生成的文本更清晰、更精确。此外，输入与行标题或整个表匹配的相关句子是有效的。摘要：This paper addresses the problem of generating table captions for scholarly documents, which often require additional information outside the table. To this end, we propose a method of retrieving relevant sentences from the paper body, and feeding the table content as well as the retrieved sentences into pre-trained language models (e.g. T5 and GPT-2) for generating table captions. The contributions of this paper are: (1) discussion on the challenges in table captioning for scholarly documents; (2) development of a dataset DocBank-TB, which is publicly available; and (3) comparison of caption generation methods for scholarly documents with different strategies to retrieve relevant sentences from the paper body. Our experimental results showed that T5 is the better generation model for this task, as it outperformed GPT-2 in BLEU and METEOR implying that the generated text are clearer and more precise. Moreover, inputting relevant sentences matching the row header or whole table is effective.

【2】 Affective Decoding for Empathetic Response Generation 标题：同理心反应产生的情感解码链接：https://arxiv.org/abs/2108.08102

作者：Chengkun Zheng,Guanyi Chen,Chenghua Lin,Ruizhe Li,Zhigang Chen 机构：♠Department of Computer Science, University of Sheffield, ♥Department of Information and Computing Sciences, Utrecht University, ♣College of Information and Intelligent Engineering, Zhejiang Wanli University 备注：Long paper accepted to INLG 2021 摘要：理解说话人的感受并通过情感联系做出适当的反应是移情对话系统的一项关键交际技能。在本文中，我们提出了一种简单的技术称为情感解码的移情反应生成。我们的方法可以在每个解码步骤中有效地合并情感信号，并且还可以使用辅助的双情感编码器进行增强，该编码器在给定对话的情感基础的情况下为说话人和听话人学习单独的嵌入。大量的实证研究表明，与几种主流的移情反应方法相比，我们的模型在人类评价中被认为更具移情性。摘要：Understanding speaker's feelings and producing appropriate responses with emotion connection is a key communicative skill for empathetic dialogue systems. In this paper, we propose a simple technique called Affective Decoding for empathetic response generation. Our method can effectively incorporate emotion signals during each decoding step, and can additionally be augmented with an auxiliary dual emotion encoder, which learns separate embeddings for the speaker and listener given the emotion base of the dialogue. Extensive empirical studies show that our models are perceived to be more empathetic by human evaluations, in comparison to several strong mainstream methods for empathetic responding.

检测相关(2篇)

【1】 Fake News and Phishing Detection Using a Machine Learning Trained Expert System 标题：基于机器学习训练的专家系统在假新闻和钓鱼检测中的应用链接：https://arxiv.org/abs/2108.08264

作者：Benjamin Fitzpatrick,Xinyu "Sherwin" Liang,Jeremy Straub 机构：Department of Electrical and Computer Engineering, University of Alabama, H.M. Comer,th Avenue, Tuscaloosa, AL , Phone: , (,) ,-, Fax: , (,) ,-, Xinyu “Sherwin” Liang, Dallas College – North Lake, N. MacArthur Blvd., Irving, TX , Corresponding Author 摘要：专家系统已被用来使计算机能够作出建议和决定。本文介绍了一个机器学习训练专家系统（MLES）在钓鱼网站检测和假新闻检测中的应用。这两个主题都有一个相似的目标：设计一个规则事实网络，允许计算机像领域专家一样在各个领域做出可解释的决策。钓鱼网站检测研究使用MLES通过分析网站属性（如URL长度和过期时间）来检测潜在的钓鱼网站。虚假新闻检测研究使用MLES规则事实网络，根据情感、说话人的政治背景和工作等因素来衡量新闻报道的真实性。这两项研究使用不同的MLES网络实现，本文对此进行了介绍和比较。假新闻研究采用了更线性的设计，而网络钓鱼项目则采用了更复杂的连接结构。这两个网络的输入都基于常用的数据集。摘要：Expert systems have been used to enable computers to make recommendations and decisions. This paper presents the use of a machine learning trained expert system (MLES) for phishing site detection and fake news detection. Both topics share a similar goal: to design a rule-fact network that allows a computer to make explainable decisions like domain experts in each respective area. The phishing website detection study uses a MLES to detect potential phishing websites by analyzing site properties (like URL length and expiration time). The fake news detection study uses a MLES rule-fact network to gauge news story truthfulness based on factors such as emotion, the speaker's political affiliation status, and job. The two studies use different MLES network implementations, which are presented and compared herein. The fake news study utilized a more linear design while the phishing project utilized a more complex connection structure. Both networks' inputs are based on commonly available data sets.

【2】 Joint Multiple Intent Detection and Slot Filling via Self-distillation 标题：基于自蒸馏的联合多意图检测与缝隙填充链接：https://arxiv.org/abs/2108.08042

作者：Lisong Chen,Peilin Zhou,Yuexian Zou 机构：ADSPLAB, School of ECE, Peking University, Shenzhen, China, Peng Cheng Laboratory, Shenzhen, China 摘要：意图检测和时隙填充是自然语言理解（NLU）中的两个主要任务，用于从用户的话语中识别用户的需求。这两项任务高度相关，通常是联合训练的。然而，大多数以前的工作都假设每个话语只对应一个意图，忽略了用户话语在许多情况下可能包含多个意图这一事实。在本文中，我们提出了一种新的用于多目标NLU的自蒸馏联合NLU模型（SDJN）。首先，我们将多意图检测描述为一个弱监督问题，并采用多实例学习（MIL）方法。然后，我们通过自蒸馏设计了一个辅助环路，其中包含三个有序排列的解码器：初始时隙解码器、MIL意向解码器和最终时隙解码器。每个解码器的输出将作为下一个解码器的辅助信息。利用MIL意向解码器提供的辅助知识，我们将最终时隙解码器设置为教师模型，将知识传递回初始时隙解码器以完成循环。辅助环路使意图和时隙能够相互深入地引导，并进一步提高NLU的整体性能。在两个公共多意图数据集上的实验结果表明，与其他模型相比，我们的模型具有很强的性能。摘要：Intent detection and slot filling are two main tasks in natural language understanding (NLU) for identifying users' needs from their utterances. These two tasks are highly related and often trained jointly. However, most previous works assume that each utterance only corresponds to one intent, ignoring the fact that a user utterance in many cases could include multiple intents. In this paper, we propose a novel Self-Distillation Joint NLU model (SDJN) for multi-intent NLU. First, we formulate multiple intent detection as a weakly supervised problem and approach with multiple instance learning (MIL). Then, we design an auxiliary loop via self-distillation with three orderly arranged decoders: Initial Slot Decoder, MIL Intent Decoder, and Final Slot Decoder. The output of each decoder will serve as auxiliary information for the next decoder. With the auxiliary knowledge provided by the MIL Intent Decoder, we set Final Slot Decoder as the teacher model that imparts knowledge back to Initial Slot Decoder to complete the loop. The auxiliary loop enables intents and slots to guide mutually in-depth and further boost the overall NLU performance. Experimental results on two public multi-intent datasets indicate that our model achieves strong performance compared to others.

识别/分类(1篇)

【1】 De-identification of Unstructured Clinical Texts from Sequence to Sequence Perspective 标题：从序列到序列视角的非结构化临床文本去识别链接：https://arxiv.org/abs/2108.07971

作者：Md Monowar Anjum,Noman Mohammed,Xiaoqian Jiang 机构：Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada, School of Biomedical Informatics, University of Texas, Houston, TX, USA 备注：Currently Under Consideration for ACM CCS 2021 摘要：在这项工作中，我们提出了一个新的问题公式，用于非结构化临床文本的去识别。我们将去识别问题描述为一个序列到序列的学习问题，而不是一个令牌分类问题。我们的方法受到最近最先进的命名实体识别序列到序列学习模型的启发。我们提出的方法的早期实验在i2b2数据集上实现了98.91%的召回率。此性能与当前非结构化临床文本去识别的最先进模型相当。摘要：In this work, we propose a novel problem formulation for de-identification of unstructured clinical text. We formulate the de-identification problem as a sequence to sequence learning problem instead of a token classification problem. Our approach is inspired by the recent state-of -the-art performance of sequence to sequence learning models for named entity recognition. Early experimentation of our proposed approach achieved 98.91% recall rate on i2b2 dataset. This performance is comparable to current state-of-the-art models for unstructured clinical text de-identification.

Zero/Few/One-Shot|迁移|自适应(1篇)

【1】 AdapterHub Playground: Simple and Flexible Few-Shot Learning with Adapters 标题：AdapterHub游乐场：使用适配器轻松灵活地学习链接：https://arxiv.org/abs/2108.08103

作者：Tilman Beck,Bela Bohlender,Christina Viehmann,Vincent Hane,Yanik Adamson,Jaber Khuri,Jonas Brossmann,Jonas Pfeiffer,Iryna Gurevych 机构：Ubiquitous Knowledge Processing (UKP) Lab, Technical University of Darmstadt, Germany, Institut für Publizistik, Johannes Gutenberg-University Mainz, Germany 摘要：通过在线存储库公开传播预先训练的语言模型已经导致了最先进的自然语言处理（NLP）研究的民主化。这也允许NLP之外的人使用此类模型，并使其适应特定的用例。然而，仍然需要一定程度的技术熟练程度，这对于希望将这些模型应用于特定任务但缺乏必要知识或资源的用户来说是一个入门障碍。在这项工作中，我们的目标是通过提供一种工具来克服这一差距，该工具允许研究人员在不编写一行代码的情况下利用预先训练的模型。我们的AdapterHub游乐场基于用于迁移学习的参数高效适配器模块，提供了直观的界面，允许使用适配器预测、训练和分析各种NLP任务的文本数据。我们展示了该工具的体系结构，并通过原型用例展示了它的优势，在这些原型用例中，我们展示了在一些快照学习场景中可以轻松提高预测性能。最后，我们在用户研究中评估了它的可用性。我们在以下位置提供代码和实时接口：https://adapter-hub.github.io/playground. 摘要：The open-access dissemination of pretrained language models through online repositories has led to a democratization of state-of-the-art natural language processing (NLP) research. This also allows people outside of NLP to use such models and adapt them to specific use-cases. However, a certain amount of technical proficiency is still required which is an entry barrier for users who want to apply these models to a certain task but lack the necessary knowledge or resources. In this work, we aim to overcome this gap by providing a tool which allows researchers to leverage pretrained models without writing a single line of code. Built upon the parameter-efficient adapter modules for transfer learning, our AdapterHub Playground provides an intuitive interface, allowing the usage of adapters for prediction, training and analysis of textual data for a variety of NLP tasks. We present the tool's architecture and demonstrate its advantages with prototypical use-cases, where we show that predictive performance can easily be increased in a few-shot learning scenario. Finally, we evaluate its usability in a user study. We provide the code and a live interface at https://adapter-hub.github.io/playground.

检索(1篇)

【1】 Learning Implicit User Profiles for Personalized Retrieval-Based Chatbot 标题：基于个性化检索的聊天机器人隐式用户模型学习链接：https://arxiv.org/abs/2108.07935

作者：Hongjin Qian,Zhicheng Dou,Yutao Zhu,Yueyuan Ma,Ji-Rong Wen 机构： Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China, Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China, Université de Montréal, Québec, Canada 备注：Accepted by CIKM 2021, codes and dataset will be released at this https URL 摘要：在本文中，我们探讨了开发个性化聊天机器人的问题。个性化聊天机器人是为用户设计的数字聊天助手。个性化聊天机器人的关键特征是它应该与相应的用户具有一致的个性。当它被授权响应其他人的消息时，它可以像用户一样说话。我们提出了一个基于检索的个性化聊天机器人模型，即IMPChat，用于从用户的对话历史中学习隐式用户配置文件。我们认为隐式用户配置文件在可访问性和灵活性方面优于显式用户配置文件。IMPChat旨在通过分别建模用户的个性化语言风格和个性化偏好来学习隐式用户配置文件。为了了解用户的个性化语言风格，我们利用用户的历史反应从浅到深精心构建语言模型；为了模拟用户的个性化偏好，我们探索了用户每个post响应对下的条件关系。个性化偏好是动态的，并且上下文感知：在聚合个性化偏好时，我们为那些与当前查询局部相关的历史对分配更高的权重。我们分别用个性化的语言风格和个性化的偏好来匹配每个响应候选，并融合这两个匹配信号来确定最终的排名分数。在两个大型数据集上的综合实验表明，我们的方法优于所有的基线模型。摘要：In this paper, we explore the problem of developing personalized chatbots. A personalized chatbot is designed as a digital chatting assistant for a user. The key characteristic of a personalized chatbot is that it should have a consistent personality with the corresponding user. It can talk the same way as the user when it is delegated to respond to others' messages. We present a retrieval-based personalized chatbot model, namely IMPChat, to learn an implicit user profile from the user's dialogue history. We argue that the implicit user profile is superior to the explicit user profile regarding accessibility and flexibility. IMPChat aims to learn an implicit user profile through modeling user's personalized language style and personalized preferences separately. To learn a user's personalized language style, we elaborately build language models from shallow to deep using the user's historical responses; To model a user's personalized preferences, we explore the conditional relations underneath each post-response pair of the user. The personalized preferences are dynamic and context-aware: we assign higher weights to those historical pairs that are topically related to the current query when aggregating the personalized preferences. We match each response candidate with the personalized language style and personalized preference, respectively, and fuse the two matching signals to determine the final ranking score. Comprehensive experiments on two large datasets show that our method outperforms all baseline models.

Word2Vec|文本|单词(3篇)

【1】 RTE: A Tool for Annotating Relation Triplets from Text 标题：RTE：一个从文本标注关系三元组的工具链接：https://arxiv.org/abs/2108.08184

作者：Ankan Mullick,Animesh Bera,Tapas Nayak 机构：CSE, IIT Kharagpur, India, Siemens Healthineers 摘要：在这项工作中，我们提出了一个基于Web的注释工具“关系三元组提取器”footnote{https://abera87.github.io/annotate/}（RTE）用于注释文本中的关系三元组。关系提取是从Web上可用的非结构化文本中提取真实世界实体的结构化信息的一项重要任务。在关系提取中，我们重点关注二元关系，即两个实体之间的关系。近年来，人们提出了许多有监督的模型来解决这一问题，但这些模型大多使用了通过远程监督方法获得的噪声训练数据。在许多情况下，模型的评估也是基于有噪声的测试数据集进行的。缺少带注释的干净数据集是这一研究领域的一个关键挑战。在这项工作中，我们构建了一个基于web的工具，研究人员可以很容易地对数据集进行注释，以便进行关系提取。我们为这个工具使用了一个无服务器架构，整个注释操作都是使用客户端代码处理的。因此，它不会受到任何网络延迟的影响，用户数据的隐私也得到了维护。我们希望这一工具将有助于研究人员推进关系抽取领域。摘要：In this work, we present a Web-based annotation tool `Relation Triplets Extractor' footnote{https://abera87.github.io/annotate/} (RTE) for annotating relation triplets from the text. Relation extraction is an important task for extracting structured information about real-world entities from the unstructured text available on the Web. In relation extraction, we focus on binary relation that refers to relations between two entities. Recently, many supervised models are proposed to solve this task, but they mostly use noisy training data obtained using the distant supervision method. In many cases, evaluation of the models is also done based on a noisy test dataset. The lack of annotated clean dataset is a key challenge in this area of research. In this work, we built a web-based tool where researchers can annotate datasets for relation extraction on their own very easily. We use a server-less architecture for this tool, and the entire annotation operation is processed using client-side code. Thus it does not suffer from any network latency, and the privacy of the user's data is also maintained. We hope that this tool will be beneficial for the researchers to advance the field of relation extraction.

【2】 Contextualizing Variation in Text Style Transfer Datasets 标题：文本样式传输数据集中的上下文变体链接：https://arxiv.org/abs/2108.07871

作者：Stephanie Schoch,Wanyu Du,Yangfeng Ji 机构：Department of Computer Science, University of Virginia, Charlottesville, VA 备注：Accepted to INLG 2021 摘要：语篇风格转换涉及以目标语风格重写源句的内容。尽管有许多可用数据的样式任务，但关于文本样式数据集之间的关系的系统性讨论仍然有限。然而，这种理解可能会对选择多个数据源进行模型训练产生影响。虽然在确定这些关系时考虑固有的风格属性是谨慎的，但我们也必须考虑如何在特定的数据集中实现样式。在本文中，我们对现有的文本风格数据集进行了一些实证分析。基于我们的结果，我们提出了一个分类的风格和数据集的性质，以考虑当使用或比较文本风格的数据集。摘要：Text style transfer involves rewriting the content of a source sentence in a target style. Despite there being a number of style tasks with available data, there has been limited systematic discussion of how text style datasets relate to each other. This understanding, however, is likely to have implications for selecting multiple data sources for model training. While it is prudent to consider inherent stylistic properties when determining these relationships, we also must consider how a style is realized in a particular dataset. In this paper, we conduct several empirical analyses of existing text style datasets. Based on our results, we propose a categorization of stylistic and dataset properties to consider when utilizing or comparing text style datasets.

【3】 Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021): Workshop and Shared Task Report 标题：从文本中自动提取社会政治事件的挑战和应用(案例2021)：研讨会和共享任务报告链接：https://arxiv.org/abs/2108.07865

作者：Ali Hürriyetoğlu,Hristo Tanev,Vanni Zavarella,Jakub Piskorski,Reyyan Yeniterzi,Erdem Yörük 机构：European Commission, Ispra, Varese, Italy, Polish Academy of Sciences, Warsaw, Poland, Sabancı University, Tuzla, ˙Istanbul, Turkey, Erdem Y¨or¨uk, Koc¸ University, Sarıyer, ˙Istanbul, Turkey 备注：None 摘要：本次研讨会是新兴市场福利项目在欧盟委员会联合研究中心的支持下组织的一系列关于从新闻中自动提取社会政治事件的研讨会的第四期，该项目的贡献来自该领域的许多其他知名学者。这一系列研讨会的目的是促进研究和开发可靠、有效、稳健和实用的解决方案，以自动检测文本流中对社会政治事件（如抗议、暴乱、战争和武装冲突）的描述。今年的研讨会贡献者利用了最先进的NLP技术，如深度学习、单词嵌入和转换器，涵盖了从文本分类到新闻偏见检测的广泛主题。大约有40个团队进行了注册，15个团队参与了三项任务，即i）多语言抗议新闻检测，ii）社会政治事件的细粒度分类，以及iii）发现抗议事件中的黑人生命。研讨会还重点介绍了两次主题演讲和四次特邀演讲，讨论了在少镜头和Zero-Shot环境下创建事件数据集和多语言和跨语言机器学习的各个方面。摘要：This workshop is the fourth issue of a series of workshops on automatic extraction of socio-political events from news, organized by the Emerging Market Welfare Project, with the support of the Joint Research Centre of the European Commission and with contributions from many other prominent scholars in this field. The purpose of this series of workshops is to foster research and development of reliable, valid, robust, and practical solutions for automatically detecting descriptions of socio-political events, such as protests, riots, wars and armed conflicts, in text streams. This year workshop contributors make use of the state-of-the-art NLP technologies, such as Deep Learning, Word Embeddings and Transformers and cover a wide range of topics from text classification to news bias detection. Around 40 teams have registered and 15 teams contributed to three tasks that are i) multilingual protest news detection, ii) fine-grained classification of socio-political events, and iii) discovering Black Lives Matter protest events. The workshop also highlights two keynote and four invited talks about various aspects of creating event data sets and multi- and cross-lingual machine learning in few- and zero-shot settings.

其他神经网络|深度学习|模型|建模(2篇)

【1】 Modulating Language Models with Emotions 标题：用情感调整语言模型链接：https://arxiv.org/abs/2108.07886

作者：Ruibo Liu,Jason Wei,Chenyan Jia,Soroush Vosoughi 机构：Dartmouth College, ProtagoLabs, University of Texas at Austin 备注：Findings of ACL 2021 摘要：生成包含不同情感的上下文感知语言是构建移情NLP系统的重要一步。在本文中，我们提出了一种调制层规范化的公式——一种受计算机视觉启发的技术——它允许我们使用大规模语言模型来生成情绪反应。在MojiTalk数据集的自动和人工评估中，我们提出的调制层标准化方法在保持多样性、流畅性和一致性的同时优于先前的基线方法。即使只使用10%的可用训练数据，我们的方法也能获得有竞争力的性能。摘要：Generating context-aware language that embodies diverse emotions is an important step towards building empathetic NLP systems. In this paper, we propose a formulation of modulated layer normalization -- a technique inspired by computer vision -- that allows us to use large-scale language models for emotional response generation. In automatic and human evaluation on the MojiTalk dataset, our proposed modulated layer normalization method outperforms prior baseline methods while maintaining diversity, fluency, and coherence. Our method also obtains competitive performance even when using only 10% of the available training data.

【2】 A comparative study of universal quantum computing models: towards a physical unification 标题：普适量子计算模型的比较研究：走向物理统一链接：https://arxiv.org/abs/2108.07909

作者：D. -S. Wang 机构：cn; CAS Key Laboratory of Theoretical Physics, Institute of Theoretical Physics 摘要：量子计算一直是量子物理学中一个引人注目的研究领域。最近的进展促使我们深入研究普适量子计算模型（UQCM），它位于量子计算的基础上，与基础物理紧密相连。虽然已经发展了几十年，但仍然缺乏一个物理上简明的原理或图片来形式化和理解UQCM。考虑到仍在出现的模型的多样性，这是一个挑战，但理解经典计算和量子计算之间的差异很重要。在这项工作中，我们对统一UQCM进行了初步尝试，将其中一些分类为两类，从而制作了一个模型表。有了这样一个表格，一些已知的模型或方案看起来像是模型的杂交或组合，更重要的是，它导致了尚未探索的新方案。我们对UQCM的研究也为量子算法带来了一些见解。这项工作揭示了系统研究计算模型的重要性和可行性。摘要：Quantum computing has been a fascinating research field in quantum physics. Recent progresses motivate us to study in depth the universal quantum computing models (UQCM), which lie at the foundation of quantum computing and have tight connections with fundamental physics. Although being developed decades ago, a physically concise principle or picture to formalize and understand UQCM is still lacking. This is challenging given the diversity of still-emerging models, but important to understand the difference between classical and quantum computing. In this work, we carried out a primary attempt to unify UQCM by classifying a few of them as two categories, hence making a table of models. With such a table, some known models or schemes appear as hybridization or combination of models, and more importantly, it leads to new schemes that have not been explored yet. Our study of UQCM also leads to some insights into quantum algorithms. This work reveals the importance and feasibility of systematic study of computing models.

其他(4篇)

【1】 Deep Natural Language Processing for LinkedIn Search Systems 标题：LinkedIn搜索系统的深度自然语言处理链接：https://arxiv.org/abs/2108.08252

作者：Weiwei Guo,Xiaowei Liu,Sida Wang,Michaeel Kazi,Zhoutong Fu,Huiji Gao,Jun Jia,Liang Zhang,Bo Long 机构：LinkedIn, Mountain View, CA 摘要：许多搜索系统处理大量自然语言数据，例如搜索查询、用户配置文件和文档，其中基于深度学习的自然语言处理技术（deep-NLP）可以提供很大帮助。在这篇文章中，我们介绍了一项综合研究，将深度自然语言处理技术应用于搜索引擎中的五个代表性任务。通过五个任务的模型设计和实验，读者可以找到三个重要问题的答案：（1）深度自然语言处理在搜索系统中什么时候有用/不有用(2）如何应对延迟挑战(3）如何确保模型的健壮性？这项工作建立在LinkedIn搜索现有工作的基础上，并在商业搜索引擎上进行了大规模测试。我们相信，我们的经验可以为行业和研究界提供有用的见解。摘要：Many search systems work with large amounts of natural language data, e.g., search queries, user profiles and documents, where deep learning based natural language processing techniques (deep NLP) can be of great help. In this paper, we introduce a comprehensive study of applying deep NLP techniques to five representative tasks in search engines. Through the model design and experiments of the five tasks, readers can find answers to three important questions: (1) When is deep NLP helpful/not helpful in search systems? (2) How to address latency challenges? (3) How to ensure model robustness? This work builds on existing efforts of LinkedIn search, and is tested at scale on a commercial search engine. We believe our experiences can provide useful insights for the industry and research communities.

【2】 X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics 标题：X-modaler：一个用于跨模态分析的通用高性能代码库链接：https://arxiv.org/abs/2108.08217

作者：Yehao Li,Yingwei Pan,Jingwen Chen,Ting Yao,Tao Mei 机构：JD AI Research, Beijing, China 备注：Accepted by 2021 ACMMM Open Source Software Competition. Source code: this https URL 摘要：随着深度学习在过去十年中的兴起和发展，出现了稳定的创新和突破势头，令人信服地推动了多媒体领域视觉和语言之间的跨模态分析的最新发展。然而，还没有一个开源代码库来支持以统一和模块化的方式训练和部署用于跨模态分析的众多神经网络模型。在这项工作中，我们提出了X-modaler——一种通用的高性能代码库，它将最先进的跨模态分析封装到几个通用阶段（例如预处理、编码器、跨模态交互、解码器和解码策略）。每个阶段都具有功能，涵盖了一系列在最新技术中广泛采用的模块，并允许在这些模块之间无缝切换。这种方式自然能够灵活地实现图像字幕、视频字幕和视觉语言预训练的最新算法，以促进研究社区的快速发展。同时，由于几个阶段的有效模块化设计（例如，跨模态交互）在不同的视觉语言任务中共享，因此X-modaler可以简单地扩展为跨模态分析中其他任务的启动原型，包括视觉问答、视觉常识推理和跨模态检索。X-modaler是Apache许可的代码库，其源代码、示例项目和预先训练的模型可在线获取：https://github.com/YehLi/xmodaler. 摘要：With the rise and development of deep learning over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the state-of-the-art of cross-modal analytics between vision and language in multimedia field. Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion. In this work, we propose X-modaler -- a versatile and high-performance codebase that encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages (e.g., pre-processing, encoder, cross-modal interaction, decoder, and decode strategy). Each stage is empowered with the functionality that covers a series of modules widely adopted in state-of-the-arts and allows seamless switching in between. This way naturally enables a flexible implementation of state-of-the-art algorithms for image captioning, video captioning, and vision-language pre-training, aiming to facilitate the rapid development of research community. Meanwhile, since the effective modular designs in several stages (e.g., cross-modal interaction) are shared across different vision-language tasks, X-modaler can be simply extended to power startup prototypes for other tasks in cross-modal analytics, including visual question answering, visual commonsense reasoning, and cross-modal retrieval. X-modaler is an Apache-licensed codebase, and its source codes, sample projects and pre-trained models are available on-line: https://github.com/YehLi/xmodaler.

【3】 SHAQ: Single Headed Attention with Quasi-Recurrence 标题：沙克：单头注意力与准重复链接：https://arxiv.org/abs/2108.08207

作者：Nashwin Bharwani,Warren Kushner,Sangeet Dandona,Ben Schreiber 机构： This is fea-sible for researchers at big tech companies and leading re-search universities 备注：8 pages, 11 figures 摘要：近年来，自然语言处理的研究主要集中在大规模变换模型上。尽管Transformer在许多重要的语言任务上都达到了最先进的水平，但它通常需要昂贵的计算资源，并且需要几天到几周的时间来训练。这对大型科技公司和一流研究型大学的研究人员来说是可行的，但对好斗的创业者、学生和独立研究人员来说则不然。Stephen Merity的SHA-RNN是一种紧凑的混合注意力RNN模型，专为消费者级建模而设计，因为它需要更少的参数和更少的训练时间才能达到接近最先进的结果。我们在这里通过对几个体系结构单元的探索性模型分析来分析Merity的模型，在我们的评估中考虑了训练时间和总体质量。最后，我们将这些发现结合到一个新的结构中，我们称之为SHAQ：单头注意准递归神经网络。通过我们的新架构，我们实现了与SHA-RNN相似的精度结果，同时在训练中实现了4倍的速度提升。摘要：Natural Language Processing research has recently been dominated by large scale transformer models. Although they achieve state of the art on many important language tasks, transformers often require expensive compute resources, and days spanning to weeks to train. This is feasible for researchers at big tech companies and leading research universities, but not for scrappy start-up founders, students, and independent researchers. Stephen Merity's SHA-RNN, a compact, hybrid attention-RNN model, is designed for consumer-grade modeling as it requires significantly fewer parameters and less training time to reach near state of the art results. We analyze Merity's model here through an exploratory model analysis over several units of the architecture considering both training time and overall quality in our assessment. Ultimately, we combine these findings into a new architecture which we call SHAQ: Single Headed Attention Quasi-recurrent Neural Network. With our new architecture we achieved similar accuracy results as the SHA-RNN while accomplishing a 4x speed boost in training.

【4】 Higher-Order Concurrency for Microcontrollers 标题：微控制器的高阶并发性链接：https://arxiv.org/abs/2108.07805

作者：Abhiroop Sarkar,Robert Krook,Bo Joel Svensson,Mary Sheeran 机构：Chalmers University, Gothenburg, Sweden 摘要：微控制器编程涉及与并发和反应性硬件和外围设备的低级接口。这类程序通常使用并发语言扩展（如$texttt{FreeRTOS tasks}$和$texttt{semaphores}$）以C和汇编的混合语言编写，导致代码不安全、回调驱动、易出错且难以维护。我们通过引入$texttt{SenseVM}$来解决这一挑战，这是一种字节码解释的虚拟机，它为微控制器编程提供了基于消息传递的$texttt{higher-order concurrency}$模型，最初由Reppy引入。此模型将同步操作视为一级值（称为$texttt{Events}$），类似于函数式语言中对一级函数的处理。这主要允许程序员编写和定制他们自己的并发抽象，此外，抽象掉共享内存并发模型中常见的不安全内存操作，从而使微控制器程序更安全、可编写和更易于维护。我们的虚拟机通过构建在嵌入式操作系统Zephyr之上的低级$textit{bridge}$接口进行移植。桥接器由所有驱动程序实现，其设计使得响应软件消息或硬件中断的编程保持统一且不可区分。在本文中，我们通过一个在$texttt{nRF52840}$和$texttt{STM32F4}$微控制器上运行的类似Caml的函数式语言编写的示例来演示VM的功能。摘要：Programming microcontrollers involves low-level interfacing with hardware and peripherals that are concurrent and reactive. Such programs are typically written in a mixture of C and assembly using concurrent language extensions (like $texttt{FreeRTOS tasks}$ and $texttt{semaphores}$), resulting in unsafe, callback-driven, error-prone and difficult-to-maintain code. We address this challenge by introducing $texttt{SenseVM}$ - a bytecode-interpreted virtual machine that provides a message-passing based $textit{higher-order concurrency}$ model, originally introduced by Reppy, for microcontroller programming. This model treats synchronous operations as first-class values (called $texttt{Events}$) akin to the treatment of first-class functions in functional languages. This primarily allows the programmer to compose and tailor their own concurrency abstractions and, additionally, abstracts away unsafe memory operations, common in shared-memory concurrency models, thereby making microcontroller programs safer, composable and easier-to-maintain. Our VM is made portable via a low-level $textit{bridge}$ interface, built atop the embedded OS - Zephyr. The bridge is implemented by all drivers and designed such that programming in response to a software message or a hardware interrupt remains uniform and indistinguishable. In this paper we demonstrate the features of our VM through an example, written in a Caml-like functional language, running on the $texttt{nRF52840}$ and $texttt{STM32F4}$ microcontrollers.

linux https 网络安全 NLP服务

0 人点赞