cs.CL 方向,今日共计20篇
BERT(1篇)
【1】 Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition 标题:采用GPT、GPT-2和BERT语言模型进行语音识别 链接:https://arxiv.org/abs/2108.07789
作者:Xianrui Zheng,Chao Zhang,Philip C. Woodland 机构:Cambridge University Engineering Dept., Trumpington St., Cambridge, CB,PZ U.K. 摘要:语言模型(LMs)在大量文本上预先训练,特别是来自Transformers(BERT)、generative pre training(GPT)和GPT-2的双向编码器表示,已经成为许多自然语言处理任务的关键技术。在本文中,我们介绍了使用微调GPT、GPT-2及其组合进行自动语音识别(ASR)的结果。与单向LM-GPT和GPT-2不同,BERT是双向的,其输出概率的直接乘积不再是有效的语言先验概率。提出了一种转换方法,以数学精确的方式计算基于双向LM输出的正确语言先验概率。在广泛使用的AMI和交换机ASR任务上的实验结果表明,微调GPT和GPT-2的组合优于三种不同结构的神经LMs的组合,在域内文本上从头开始训练,相对字错误率降低(WERR)高达12%。此外,建议的语言先验概率转换使BERT能够接收额外的3%相对WERR,并且BERT、GPT和GPT-2的组合导致进一步的改进。 摘要:Language models (LMs) pre-trained on massive amounts of text, in particular bidirectional encoder representations from Transformers (BERT), generative pre-training (GPT), and GPT-2, have become a key technology for many natural language processing tasks. In this paper, we present results using fine-tuned GPT, GPT-2, and their combination for automatic speech recognition (ASR). Unlike unidirectional LM GPT and GPT-2, BERT is bidirectional whose direct product of the output probabilities is no longer a valid language prior probability. A conversion method is proposed to compute the correct language prior probability based on bidirectional LM outputs in a mathematically exact way. Experimental results on the widely used AMI and Switchboard ASR tasks showed that the combination of the fine-tuned GPT and GPT-2 outperformed the combination of three neural LMs with different architectures trained from scratch on the in-domain text by up to a 12% relative word error rate reduction (WERR). Furthermore, the proposed conversion for language prior probabilities enables BERT to receive an extra 3% relative WERR, and the combination of BERT, GPT and GPT-2 results in further improvements.
QA|VQA|问答|对话(1篇)
【1】 Generative Relation Linking for Question Answering over Knowledge Bases 标题:知识库问答中的产生式关系链接 链接:https://arxiv.org/abs/2108.07337
作者:Gaetano Rossiello,Nandana Mihindukulasooriya,Ibrahim Abdelaziz,Mihaela Bornea,Alfio Gliozzo,Tahira Naseem,Pavan Kapanipathi 机构:IBM Research, T.J. Watson Research Center, Yorktown Heights, NY, USA 备注:Accepted at the 20th International Semantic Web Conference (ISWC 2021) 摘要:关系链接对于在知识库上回答问题至关重要。尽管有各种各样的努力来提高关系链接性能,但目前最先进的方法并没有达到最佳效果,因此,对整体端到端问答性能产生了负面影响。在这项工作中,我们提出了一种新的关系链接方法,将其作为一个生成问题来构建,以便于使用预先训练好的序列到序列模型。我们将这种序列模型扩展到序列模型,其思想是注入来自目标知识库的结构化数据,主要是使这些模型能够处理知识库的细微差别。此外,我们训练模型的目的是生成由参数关系对列表组成的结构化输出,从而实现知识验证步骤。我们将我们的方法与来自DBpedia和Wikidata的四个不同数据集上的现有关系链接系统进行了比较。我们的方法报告了与最新技术相比的巨大改进,同时使用了一个更简单的模型,可以轻松地适应不同的知识库。 摘要:Relation linking is essential to enable question answering over knowledge bases. Although there are various efforts to improve relation linking performance, the current state-of-the-art methods do not achieve optimal results, therefore, negatively impacting the overall end-to-end question answering performance. In this work, we propose a novel approach for relation linking framing it as a generative problem facilitating the use of pre-trained sequence-to-sequence models. We extend such sequence-to-sequence models with the idea of infusing structured data from the target knowledge base, primarily to enable these models to handle the nuances of the knowledge base. Moreover, we train the model with the aim to generate a structured output consisting of a list of argument-relation pairs, enabling a knowledge validation step. We compared our method against the existing relation linking systems on four different datasets derived from DBpedia and Wikidata. Our method reports large improvements over the state-of-the-art while using a much simpler model that can be easily adapted to different knowledge bases.
语义分析(2篇)
【1】 A Game Interface to Study Semantic Grounding in Text-Based Models 标题:一个研究基于文本的模型中语义基础的游戏界面 链接:https://arxiv.org/abs/2108.07708
作者:Timothee Mickus,Mathieu Constant,Denis Paperno 机构:ATILF, CNRS, Universit´e de Lorraine, Nancy, France, Linguistics Department, Utrecht University, Utrecht, Netherlands 摘要:语言模型能否仅从文本分布中学习扎根表示?这个问题在自然语言处理中既是中心问题又是反复出现的问题;作者们普遍认为,根植需要的不仅仅是文本分布。我们建议通过实验来检验这一说法:如果任何两个词有不同的含义,但不能单独从分布中区分,那么基于文本的模型就无法实现接地。为此,我们介绍了一个在线游戏的早期工作,该游戏收集了人类对五种语言中词对分布相似性的判断。我们进一步报告了数据收集活动的早期结果。 摘要:Can language models learn grounded representations from text distribution alone? This question is both central and recurrent in natural language processing; authors generally agree that grounding requires more than textual distribution. We propose to experimentally test this claim: if any two words have different meanings and yet cannot be distinguished from distribution alone, then grounding is out of the reach of text-based models. To that end, we present early work on an online game for the collection of human judgments on the distributional similarity of word pairs in five languages. We further report early results of our data collection campaign.
【2】 Not All Linearizations Are Equally Data-Hungry in Sequence Labeling Parsing 标题:并不是所有的线性化在序列标注解析中都同样需要数据 链接:https://arxiv.org/abs/2108.07556
作者:Alberto Muñoz-Ortiz,Michalina Strzyz,David Vilares 机构:Universidade da Coru˜na, CITIC, Spain, Priberam Labs, Portugal 备注:Accepted at RANLP 2021 (this https URL) 摘要:已经提出了不同的线性化方法,将依赖解析转换为序列标签,并将此任务解决为:(i)头部选择问题,(ii)将标记弧表示为括号字符串,或(iii)将基于转换的解析器的部分转换序列与字相关联。然而,人们对这些线性化在低资源设置中的行为知之甚少。在这里,我们首先研究它们的数据效率,模拟来自不同丰富资源树库的数据受限设置。其次,我们测试这些差异是否在真正的低资源设置中表现出来。结果表明,头部选择编码在理想(gold)框架中数据效率更高,性能更好,但当运行的设置类似于现实世界的低资源配置时,这种优势在括号格式中大大消失。 摘要:Different linearizations have been proposed to cast dependency parsing as sequence labeling and solve the task as: (i) a head selection problem, (ii) finding a representation of the token arcs as bracket strings, or (iii) associating partial transition sequences of a transition-based parser to words. Yet, there is little understanding about how these linearizations behave in low-resource setups. Here, we first study their data efficiency, simulating data-restricted setups from a diverse set of rich-resource treebanks. Second, we test whether such differences manifest in truly low-resource setups. The results show that head selection encodings are more data-efficient and perform better in an ideal (gold) framework, but that such advantage greatly vanishes in favour of bracketing formats when the running setup resembles a real-world low-resource configuration.
Graph|知识图谱|Knowledge(2篇)
【1】 MigrationsKB: A Knowledge Base of Public Attitudes towards Migrations and their Driving Factors 标题:移民KB:公众对移民的态度及其驱动因素的知识库 链接:https://arxiv.org/abs/2108.07593
作者:Yiyi Chen,Harald Sack,Mehwish Alam 机构:FIZ Karlsruhe – Leibniz Institute for Information Infrastructure, Germany, Karlsruhe Institute of Technology, Institute AIFB, Germany 备注:19 pages, 11 figures 摘要:随着欧洲移民话题的日益增多,公众现在更多地通过Twitter等各种平台表达自己的观点。因此,理解网络话语对于捕捉公众舆论至关重要。本研究的目标是分析社交媒体平台,以量化公众对移民的态度,并确定导致这些态度的不同因素。使用先进的主题建模技术,收集、预处理和过滤了2013年至2021年7月期间欧洲国家移民的推文。通过基于BERT的实体链接和情感分析,以及基于注意的仇恨语音检测对策划的推文进行注释。此外,还利用外部数据库查明造成人们对移徙持消极态度的潜在社会和经济因素。为了进一步促进社会科学和计算机科学跨学科领域的研究,成果被纳入知识库(KB),即移民知识库,该知识库大大扩展了现有模型,以考虑公众对移民的态度和经济指标。该知识库使用公平原则公开,可以通过SPARQL端点查询。Zenodo上提供了数据转储。 摘要:With the increasing trend in the topic of migration in Europe, the public is now more engaged in expressing their opinions through various platforms such as Twitter. Understanding the online discourses is therefore essential to capture the public opinion. The goal of this study is the analysis of social media platform to quantify public attitudes towards migrations and the identification of different factors causing these attitudes. The tweets spanning from 2013 to Jul-2021 in the European countries which are hosts to immigrants are collected, pre-processed, and filtered using advanced topic modeling technique. BERT-based entity linking and sentiment analysis, and attention-based hate speech detection are performed to annotate the curated tweets. Moreover, the external databases are used to identify the potential social and economic factors causing negative attitudes of the people about migration. To further promote research in the interdisciplinary fields of social science and computer science, the outcomes are integrated into a Knowledge Base (KB), i.e., MigrationsKB which significantly extends the existing models to take into account the public attitudes towards migrations and the economic indicators. This KB is made public using FAIR principles, which can be queried through SPARQL endpoint. Data dumps are made available on Zenodo.
【2】 Graph Capsule Aggregation for Unaligned Multimodal Sequences 标题:非比对多模态序列的图胶囊聚集 链接:https://arxiv.org/abs/2108.07543
作者:Jianfeng Wu,Sijie Mai,Haifeng Hu 机构:Sun Yat-sen University, Guangzhou, Guangdong, China 摘要:人类通过多种方式表达自己的观点和情感,主要包括文本、听觉和视觉方式。以往的多模态情感分析大多采用递归神经网络(RNN)对对齐的多模态序列进行建模。然而,由于不同模态的采样率不同,因此对齐多模态序列是不现实的。此外,RNN容易出现梯度消失或爆炸的问题,并且其学习长程相关性的能力有限,这是建模未对齐多峰序列的主要障碍。在本文中,我们引入了图形胶囊聚合(GraphCAGE),用基于图形的神经模型和胶囊网络对未对齐的多模序列进行建模。通过将序列数据转换成图形,避免了前面提到的RNN问题。此外,胶囊网络的聚合能力和基于图形的结构使我们的模型具有可解释性,更好地解决了长期依赖的问题。实验结果表明,GraphCAGE在两个基准数据集上实现了最先进的性能,并通过Capsule网络进行了表示和解释。 摘要:Humans express their opinions and emotions through multiple modalities which mainly consist of textual, acoustic and visual modalities. Prior works on multimodal sentiment analysis mostly apply Recurrent Neural Network (RNN) to model aligned multimodal sequences. However, it is unpractical to align multimodal sequences due to different sample rates for different modalities. Moreover, RNN is prone to the issues of gradient vanishing or exploding and it has limited capacity of learning long-range dependency which is the major obstacle to model unaligned multimodal sequences. In this paper, we introduce Graph Capsule Aggregation (GraphCAGE) to model unaligned multimodal sequences with graph-based neural model and Capsule Network. By converting sequence data into graph, the previously mentioned problems of RNN are avoided. In addition, the aggregation capability of Capsule Network and the graph-based structure enable our model to be interpretable and better solve the problem of long-range dependency. Experimental results suggest that GraphCAGE achieves state-of-the-art performance on two benchmark datasets with representations refined by Capsule Network and interpretation provided.
GAN|对抗|攻击|生成相关(1篇)
【1】 Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards 标题:用于记录自然语言处理和生成的数据集和模型的可重用模板和指南:HuggingFace和GEM数据和模型卡的案例研究 链接:https://arxiv.org/abs/2108.07374
作者:Angelina McMillan-Major,Salomey Osei,Juan Diego Rodriguez,Pawan Sasanka Ammanamanchi,Sebastian Gehrmann,Yacine Jernite 机构:University of Washington, Hugging Face, KNUST, Masakhane, UT Austin, IIIT Hyderabad, Google Research 备注:15 pages; in Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021) 摘要:为数据集和模型开发文档指南和易于使用的模板是一项具有挑战性的任务,特别是考虑到参与构建自然语言处理(NLP)工具的人员的各种背景、技能和动机。尽管如此,NLP领域标准文档实践的采用促进了NLP数据集和模型的更易访问和更详细的描述,同时支持研究人员和开发人员反思他们的工作。为了帮助文档的标准化,我们提供了两个旨在开发可重用文档模板的案例研究——HuggingFace数据卡,NLP中用于数据集的通用卡,以及专注于自然语言生成的GEM基准数据和模型卡。我们描述了我们开发这些模板的过程,包括相关利益团体的识别、一组指导原则的定义、现有模板的使用作为我们的基础、以及基于反馈的迭代修订。 摘要:Developing documentation guidelines and easy-to-use templates for datasets and models is a challenging task, especially given the variety of backgrounds, skills, and incentives of the people involved in the building of natural language processing (NLP) tools. Nevertheless, the adoption of standard documentation practices across the field of NLP promotes more accessible and detailed descriptions of NLP datasets and models, while supporting researchers and developers in reflecting on their work. To help with the standardization of documentation, we present two case studies of efforts that aim to develop reusable documentation templates -- the HuggingFace data card, a general purpose card for datasets in NLP, and the GEM benchmark data and model cards with a focus on natural language generation. We describe our process for developing these templates, including the identification of relevant stakeholder groups, the definition of a set of guiding principles, the use of existing templates as our foundation, and iterative revisions based on feedback.
半/弱/无监督|不确定性(1篇)
【1】 A Weak Supervised Dataset of Fine-Grained Emotions in Portuguese 标题:葡萄牙语细粒度情绪的弱监督数据集 链接:https://arxiv.org/abs/2108.07638
作者:Diogo Cortiz,Jefferson O. Silva,Newton Calegari,Ana Luísa Freitas,Ana Angélica Soares,Carolina Botelho,Gabriel Gaudencio Rêgo,Waldir Sampaio,Paulo Sergio Boggio 机构:Brazilian Network Information Center (NIC.br), S˜ao Paulo, SP – Brazil, Pontifical Catholic University of S˜ao Paulo (PUC-SP), Mackenzie Presbyterian University, Jusbrasil 摘要:情感计算是研究计算机如何识别、解释和模拟人类情感的学科。情绪分析是NLP中与此主题相关的一项常见任务,但它只关注情绪配价(积极、消极、中性)。NLP中的一种新兴方法是情感识别,它依赖于细粒度分类。本研究描述了一种创建葡萄牙语细粒度情感的基于词汇的弱监督语料库的方法。我们通过微调基于转换器的语言模型(BERT)并在金标准带注释的验证集上进行验证来评估数据集。我们的结果(F1得分=0.64)表明,在低资源环境下,基于词汇的弱监督是一种合适的初始工作策略。 摘要:Affective Computing is the study of how computers can recognize, interpret and simulate human affects. Sentiment Analysis is a common task in NLP related to this topic, but it focuses only on emotion valence (positive, negative, neutral). An emerging approach in NLP is Emotion Recognition, which relies on fined-grained classification. This research describes an approach to create a lexical-based weak supervised corpus for fine-grained emotion in Portuguese. We evaluate our dataset by fine-tuning a transformer-based language model (BERT) and validating it on a Golden Standard annotated validation set. Our results (F1-score= .64) suggest lexical-based weak supervision as an appropriate strategy for initial work in low resources environment.
检测相关(1篇)
【1】 Independent Ethical Assessment of Text Classification Models: A Hate Speech Detection Case Study 标题:文本分类模型的独立伦理评估:仇恨言语检测案例研究 链接:https://arxiv.org/abs/2108.07627
作者:Amitoj Singh,Jingshu Chen,Lihao Zhang,Amin Rasekh,Ilana Golbin,Anand Rao 机构:Emerging Technology PwC, PwC Acceleration Center, Mumbai, India, PwC Acceleration Center Shanghai, Pudong New District, Shanghai, China, Product Innovation PwC, Washington, DC, USA, Los Angeles, CA, USA, Anand S. Rao, Global AI Leader PwC, Boston, MA, USA 备注:27th SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2021), August 14 - 18, 2021 - Singapore 摘要:人工智能系统的独立伦理评估是对系统的开发、部署和使用是否符合伦理价值的公正审查。在过去几年中,已经开发了描述高级需求的系统级定性框架和测量个体道德维度的组件级定量指标。然而,两者之间存在差距,这阻碍了独立伦理评估在实践中的执行。本研究弥补了这一差距,并为文本分类模型设计了一个整体独立的道德评估过程,特别关注仇恨言语检测任务。通过保护属性挖掘和基于反事实的分析进一步增强评估,以增强偏差评估。它包括对技术性能、数据偏差、嵌入偏差、分类偏差和可解释性的评估。通过对深度仇恨语音检测模型的评估,证明了该方法的有效性。 摘要:An independent ethical assessment of an artificial intelligence system is an impartial examination of the system's development, deployment, and use in alignment with ethical values. System-level qualitative frameworks that describe high-level requirements and component-level quantitative metrics that measure individual ethical dimensions have been developed over the past few years. However, there exists a gap between the two, which hinders the execution of independent ethical assessments in practice. This study bridges this gap and designs a holistic independent ethical assessment process for a text classification model with a special focus on the task of hate speech detection. The assessment is further augmented with protected attributes mining and counterfactual-based analysis to enhance bias assessment. It covers assessments of technical performance, data bias, embedding bias, classification bias, and interpretability. The proposed process is demonstrated through an assessment of a deep hate speech detection model.
识别/分类(2篇)
【1】 Dynamic Multi-scale Convolution for Dialect Identification 标题:动态多尺度卷积在方言识别中的应用 链接:https://arxiv.org/abs/2108.07787
作者:Tianlong Kong,Shouyi Yin,Dawei Zhang,Wang Geng,Xin Wang,Dandan Song,Jinwen Huang,Huiyu Shi,Xiaorui Wang 机构:Institute of Microelectronics, Tsinghua University, Kwai, Beijing, P.R. China 摘要:基于时延神经网络(TDNN)的方法广泛应用于方言识别。然而,在以前的TDNN应用中,细微的变化在不同的特征尺度中被忽略了。为了解决这个问题,我们提出了一种新的体系结构,称为动态多尺度卷积,它包括动态核卷积、局部多尺度学习和全局多尺度池。动态核卷积自适应地捕获短期和长期上下文之间的特征。局部多尺度学习在粒度级别上表示多尺度特征,能够增加卷积运算的感受野范围。此外,全局多尺度池用于聚合来自不同瓶颈层的特征,以便从多个方面收集信息。在2020年东方语言识别(OLR)挑战赛的AP20 OLR方言任务中,所提出的体系结构显著优于最先进的系统,最佳平均性价比(Cavg)为0.067,最佳等错误率(EER)为6.52%。与已知的最佳结果相比,我们的方法实现了9%的Cavg和45%的EER相对改善。此外,该模型的参数比已知的模型少91%。 摘要:Time Delay Neural Networks (TDNN)-based methods are widely used in dialect identification. However, in previous work with TDNN application, subtle variant is being neglected in different feature scales. To address this issue, we propose a new architecture, named dynamic multi-scale convolution, which consists of dynamic kernel convolution, local multi-scale learning, and global multi-scale pooling. Dynamic kernel convolution captures features between short-term and long-term context adaptively. Local multi-scale learning, which represents multi-scale features at a granular level, is able to increase the range of receptive fields for convolution operation. Besides, global multi-scale pooling is applied to aggregate features from different bottleneck layers in order to collect information from multiple aspects. The proposed architecture significantly outperforms state-of-the-art system on the AP20-OLR-dialect-task of oriental language recognition (OLR) challenge 2020, with the best average cost performance (Cavg) of 0.067 and the best equal error rate (EER) of 6.52%. Compared with the known best results, our method achieves 9% of Cavg and 45% of EER relative improvement, respectively. Furthermore, the parameters of proposed model are 91% fewer than the best known model.
【2】 A Light-weight contextual spelling correction model for customizing transducer-based speech recognition systems 标题:一种用于定制基于换能器的语音识别系统的轻量级上下文拼写校正模型 链接:https://arxiv.org/abs/2108.07493
作者:Xiaoqiang Wang,Yanqing Liu,Sheng Zhao,Jinyu Li 机构:Microsoft, China, Microsoft, USA 备注:This paper has been accepted by Interspeech 2021 摘要:基于传感器的自动语音识别(ASR)系统中的上下文信息是动态的,在模型训练过程中是不可用的。在这项工作中,我们引入了一个轻量级的上下文拼写纠正模型来纠正基于传感器的ASR系统中与上下文相关的识别错误。我们使用共享上下文编码器将上下文信息合并到拼写纠正模型中,并使用过滤算法来处理大型上下文列表。实验表明,该模型提高了基线ASR模型的性能,相对字错误率降低了约50%,这也显著优于基线方法,如上下文LM偏移。该模型还显示了在训练过程中没有看到的词汇表外术语的优异性能。 摘要:It's challenging to customize transducer-based automatic speech recognition (ASR) system with context information which is dynamic and unavailable during model training. In this work, we introduce a light-weight contextual spelling correction model to correct context-related recognition errors in transducer-based ASR systems. We incorporate the context information into the spelling correction model with a shared context encoder and use a filtering algorithm to handle large-size context lists. Experiments show that the model improves baseline ASR model performance with about 50% relative word error rate reduction, which also significantly outperforms the baseline method such as contextual LM biasing. The model also shows excellent performance for out-of-vocabulary terms not seen during training.
语料库(2篇)
【1】 Annotation Guidelines for the Turku Paraphrase Corpus 标题:图尔库语释义语料库注释指南 链接:https://arxiv.org/abs/2108.07499
作者:Jenna Kanerva,Filip Ginter,Li-Hsin Chang,Iiro Rastas,Valtteri Skantsi,Jemina Kilpeläinen,Hanna-Mari Kupari,Aurora Piirto,Jenna Saarni,Maija Sevón,Otto Tarkka 备注:The Turku Paraphrase Corpus is available at this https URL 摘要:本文档描述了用于构建Turku释义语料库的注释指南。这些指南与语料库注释一起制定,在注释工作中定期修订和扩展指南。我们的释义注释方案使用基本量表1-4,其中标签1和2用于否定候选词(不是释义),而标签3和4至少在给定上下文中是释义词(如果不是处处)。除了基本标签外,该方案还增加了额外的子类别(标志),用于在两个肯定标签内对不同类型的释义进行分类,使注释方案适合更细粒度的释义分类。注释方案用于注释100000多对芬兰释义。 摘要:This document describes the annotation guidelines used to construct the Turku Paraphrase Corpus. These guidelines were developed together with the corpus annotation, revising and extending the guidelines regularly during the annotation work. Our paraphrase annotation scheme uses the base scale 1-4, where labels 1 and 2 are used for negative candidates (not paraphrases), while labels 3 and 4 are paraphrases at least in the given context if not everywhere. In addition to base labeling, the scheme is enriched with additional subcategories (flags) for categorizing different types of paraphrases inside the two positive labels, making the annotation scheme suitable for more fine-grained paraphrase categorization. The annotation scheme is used to annotate over 100,000 Finnish paraphrase pairs.
【2】 An NLP approach to quantify dynamic salience of predefined topics in a text corpus 标题:一种量化文本语料库中预定义主题动态显著性的NLP方法 链接:https://arxiv.org/abs/2108.07345
作者:A. Bock,A. Palladino,S. Smith-Heisters,I. Boardman,E. Pellegrini,E. J. Bienenstock,A. Valenti 机构: Boston Fusion Corp., Lexington, MA , USA, Arizona State University, Phoenix, AZ , USA 备注:This paper was presented at the 2021 International Conference on Social Computing, Behavioral-Cultural Modeling Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS), 9 July 2021 摘要:在线新闻媒体的激增同时也给分析人员带来了宝贵的资源和重大的挑战,这些分析人员旨在分析和了解感兴趣地理位置的社会和文化趋势。尽管大量记录重大事件、趋势和反应的新闻报道提供了一个更民主的地点社会特征图片,但对于任何一个分析师或团队来说,理解整个语料库以提取重大趋势都是一个严峻的挑战。在这里,我们提出了一种使用自然语言处理技术的方法,该方法试图量化一组预定义的感兴趣主题如何在大量文本中随时间变化。我们发现,给定一个预定义的主题,我们可以识别和排列映射到这些主题的术语集或n-gram,它们的使用模式偏离了正常的基线。n-gram用法的出现、消失或显著变化呈现了兴趣语料库中某个主题动态显著性的一幅基础图。 摘要:The proliferation of news media available online simultaneously presents a valuable resource and significant challenge to analysts aiming to profile and understand social and cultural trends in a geographic location of interest. While an abundance of news reports documenting significant events, trends, and responses provides a more democratized picture of the social characteristics of a location, making sense of an entire corpus to extract significant trends is a steep challenge for any one analyst or team. Here, we present an approach using natural language processing techniques that seeks to quantify how a set of pre-defined topics of interest change over time across a large corpus of text. We found that, given a predefined topic, we can identify and rank sets of terms, or n-grams, that map to those topics and have usage patterns that deviate from a normal baseline. Emergence, disappearance, or significant variations in n-gram usage present a ground-up picture of a topic's dynamic salience within a corpus of interest.
其他神经网络|深度学习|模型|建模(2篇)
【1】 Mitigating harm in language models with conditional-likelihood filtration 标题:用条件似然滤波减轻语言模型中的危害 链接:https://arxiv.org/abs/2108.07790
作者:Helen Ngo,Cooper Raterink,João G. M. Araújo,Ivan Zhang,Carol Chen,Adrien Morisot,Nicholas Frosst 机构:João G.M. Araújo† 摘要:在大规模未经过滤的数据集上训练的语言模型从其训练数据中获得系统性偏见、偏见和有害观点。我们提出了一种从web级数据集中以编程方式识别和删除有害文本的方法。预训练语言模型用于计算研究人员根据特定文档编写的触发短语的对数似然性,用于从数据集中识别和过滤文档。我们证明了在该过滤数据集上训练的模型生成有害文本的倾向性较低,与未过滤基线相比,在标准语言建模基准上的性能略有下降。我们通过从标准语言建模基准中呈现仇恨言论和其他不受欢迎的内容的示例,部分解释了这种性能差距。最后,我们讨论了这种方法的推广,以及研究人员如何使用反映特定价值观的触发短语来构建与其价值观更紧密一致的语言模型。 摘要:Language models trained on large-scale unfiltered datasets curated from the open web acquire systemic biases, prejudices, and harmful views from their training data. We present a methodology for programmatically identifying and removing harmful text from web-scale datasets. A pretrained language model is used to calculate the log-likelihood of researcher-written trigger phrases conditioned on a specific document, which is used to identify and filter documents from the dataset. We demonstrate that models trained on this filtered dataset exhibit lower propensity to generate harmful text, with a marginal decrease in performance on standard language modeling benchmarks compared to unfiltered baselines. We provide a partial explanation for this performance gap by surfacing examples of hate speech and other undesirable content from standard language modeling benchmarks. Finally, we discuss the generalization of this method and how trigger phrases which reflect specific values can be used by researchers to build language models which are more closely aligned with their values.
【2】 Modeling Protein Using Large-scale Pretrain Language Model 标题:基于大规模预训练语言模型的蛋白质建模 链接:https://arxiv.org/abs/2108.07435
作者:Yijia Xiao,Jiezhong Qiu,Ziang Li,Chang-Yu Hsieh,Jie Tang 机构: Department of Computer Science and Technology, Tsinghua University, Beijing Academy of Artificial Intelligence, Tencent Quantum Lab 备注:Accepted paper in Pretrain@KDD 2021 (The International Workshop on Pretraining: Algorithms, Architectures, and Applications) 摘要:蛋白质几乎与每一个生命过程都有联系。因此,分析蛋白质序列的生物结构和性质对于生命探索、疾病检测和药物发现至关重要。传统的蛋白质分析方法往往是劳动密集型和耗时的。深度学习模型的出现使得在大量数据中建模数据模式成为可能。跨学科研究人员已开始利用深度学习方法对大型生物数据集进行建模,例如使用长-短期记忆和卷积神经网络进行蛋白质序列分类。经过数百万年的进化,进化信息被编码在蛋白质序列中。受自然语言和蛋白质序列之间相似性的启发,我们使用大规模语言模型对进化规模的蛋白质序列进行建模,在表示中编码蛋白质生物学信息。在标记级和序列级任务中都观察到了显著的改进,这表明我们的大规模模型能够准确地从进化尺度的个体序列的预训练中捕获进化信息。我们的代码和模型可在https://github.com/THUDM/ProteinLM. 摘要:Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data possible. Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets, e.g. using long short-term memory and convolutional neural network for protein sequence classification. After millions of years of evolution, evolutionary information is encoded in protein sequences. Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences, encoding protein biology information in representation. Significant improvements are observed in both token-level and sequence-level tasks, demonstrating that our large-scale model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences. Our code and model are available at https://github.com/THUDM/ProteinLM.
其他(5篇)
【1】 Combining speakers of multiple languages to improve quality of neural voices 标题:组合多种语言的说话人以提高神经语音的质量 链接:https://arxiv.org/abs/2108.07737
作者:Javier Latorre,Charlotte Bailleul,Tuuli Morrill,Alistair Conkie,Yannis Stylianou 机构:Apple 备注:6 pages. 3 figures. Accepted to 11th Speech Synthesis Workshop, SSW11 (this https URL) 摘要:在这项工作中,我们探索了开发多说话人和多语言神经TTS系统的多种架构和训练程序,目标是a)在目标语言可用数据有限时提高质量,b)实现跨语言合成。我们报告了一项大型实验的结果,该实验使用了来自15个不同地区的8种不同语言的30名说话者。系统在每个说话人的相同数据量上进行训练。与单扬声器模型相比,当建议的系统对扬声器进行微调时,在大多数情况下,其产生的质量显著提高,而用于构建单扬声器模型的扬声器数据仅使用不到$40%$。在跨语言合成中,就平均意见得分而言,生成的质量在母语单语者模型的80%以内。 摘要:In this work, we explore multiple architectures and training procedures for developing a multi-speaker and multi-lingual neural TTS system with the goals of a) improving the quality when the available data in the target language is limited and b) enabling cross-lingual synthesis. We report results from a large experiment using 30 speakers in 8 different languages across 15 different locales. The system is trained on the same amount of data per speaker. Compared to a single-speaker model, when the suggested system is fine tuned to a speaker, it produces significantly better quality in most of the cases while it only uses less than $40%$ of the speaker's data used to build the single-speaker model. In cross-lingual synthesis, on average, the generated quality is within $80%$ of native single-speaker models, in terms of Mean Opinion Score.
【2】 On Incorrectness Logic and Kleene Algebra With Top and Tests 标题:关于有顶有测的不正确逻辑和Kleene代数 链接:https://arxiv.org/abs/2108.07707
作者:Cheng Zhang,Arthur Azevedo de Amorim,Marco Gaboardi 备注:Submitted to POPL22 for review 摘要:Kleene代数与测试(KAT)是一个用于程序推理的基本等式框架,在程序转换、网络和编译器优化等许多领域都有应用。在他的开创性工作中,Kozen证明了KAT包含命题Hoare逻辑,表明人们可以利用KAT的等式理论对while程序的(部分)正确性进行推理。在这项工作中,我们研究了KAT为关于emph{incorrectness}的推理提供的支持,相反,正如Ohern最近提出的不正确逻辑所体现的那样。我们证明了KAT不能直接表达不正确逻辑。这种限制的主要原因可以追溯到这样一个事实,即KAT不能明确表达codomain的概念,codomain是表达不正确三元组的关键。为了解决这个问题,我们研究了带有Top和Tests的Kleene代数(TopKAT),它是带有Top元素的KAT的一个扩展。我们证明了TopKAT强大到足以表达一个密码域运算,表达不正确三元组,并证明所有不正确逻辑规则。这表明,人们可以用托普卡特的等式理论来解释while-like程序的不正确性。 摘要:Kleene algebra with tests (KAT) is a foundational equational framework for reasoning about programs, which has found applications in program transformations, networking and compiler optimizations, among many other areas. In his seminal work, Kozen proved that KAT subsumes propositional Hoare logic, showing that one can reason about the (partial) correctness of while programs by means of the equational theory of KAT. In this work, we investigate the support that KAT provides for reasoning about emph{incorrectness}, instead, as embodied by Ohearn's recently proposed incorrectness logic. We show that KAT cannot directly express incorrectness logic. The main reason for this limitation can be traced to the fact that KAT cannot express explicitly the notion of codomain, which is essential to express incorrectness triples. To address this issue, we study Kleene algebra with Top and Tests (TopKAT), an extension of KAT with a top element. We show that TopKAT is powerful enough to express a codomain operation, to express incorrectness triples, and to prove all the rules of incorrectness logic sound. This shows that one can reason about the incorrectness of while-like programs by means of the equational theory of TopKAT.
【3】 ACM-CR: A Manually Annotated Test Collection for Citation Recommendation 标题:ACM-CR:一种面向引文推荐的人工标注测试集 链接:https://arxiv.org/abs/2108.07571
作者:Florian Boudin 机构:LS,N, Université de Nantes, Nantes, France 备注:Accepted at JCDL 2021 摘要:引文推荐旨在通过为给定的输入文本推荐适当的引文,帮助研究人员搜索要引用的相关论文。此任务的现有测试集合非常嘈杂且不可靠,因为它们是从解析的PDF文件自动生成的。在本文中,我们介绍了我们正在努力创建一个公共可用的、手动注释的引用推荐测试集。我们还进行了一系列实验,以评估基于内容的基线模型在测试集合上的有效性,为将来的改进工作提供结果。我们的测试集合和用于复制实验的代码可在https://github.com/boudinfl/acm-cr 摘要:Citation recommendation is intended to assist researchers in the process of searching for relevant papers to cite by recommending appropriate citations for a given input text. Existing test collections for this task are noisy and unreliable since they are built automatically from parsed PDF papers. In this paper, we present our ongoing effort at creating a publicly available, manually annotated test collection for citation recommendation. We also conduct a series of experiments to evaluate the effectiveness of content-based baseline models on the test collection, providing results for future work to improve upon. Our test collection and code to replicate experiments are available at https://github.com/boudinfl/acm-cr
【4】 SPMoE: Generate Multiple Pattern-Aware Outputs with Sparse Pattern Mixture of Expert 标题:SPMoE:使用专家的稀疏模式混合生成多个模式感知输出 链接:https://arxiv.org/abs/2108.07535
作者:Shaobo Cui,Xintong Bao,Xuming Lin,Zhongzhou Zhao,Ji Zhang,Wei Zhou,Haiqing Chen 机构:DAMO Academy, Alibaba Group 摘要:许多生成任务遵循一对多映射关系:每个输入可以与多个输出关联。现有的方法,如条件变分自动编码器(CVAE)采用一个潜在变量来模拟这种一对多关系。然而,这种高维和密集的潜在变量缺乏解释性,通常会导致贫穷和不可控的世代。本文创新性地引入模式的语言学概念,将一对多映射分解为多个一对一映射,并进一步提出了稀疏模式混合专家模型(SPMoE)。每个一对一映射都与条件生成模式相关联,并由SPMoE专家建模。为了确保每种语言模式都能由专家模型专门处理,以获得更好的解释性和多样性,SPMoE中采用了稀疏机制来协调所有专家模型。我们评估了我们的SPMoE在释义生成任务中的性能,实验结果证明SPMoE可以在质量、模式级多样性和语料库级多样性方面实现良好的平衡。 摘要:Many generation tasks follow a one-to-many mapping relationship: each input could be associated with multiple outputs. Existing methods like Conditional Variational AutoEncoder(CVAE) employ a latent variable to model this one-to-many relationship. However, this high-dimensional and dense latent variable lacks explainability and usually leads to poor and uncontrollable generations. In this paper, we innovatively introduce the linguistic concept of pattern to decompose the one-to-many mapping into multiple one-to-one mappings and further propose a model named Sparse Pattern Mixture of Experts(SPMoE). Each one-to-one mapping is associated with a conditional generation pattern and is modeled with an expert in SPMoE. To ensure each language pattern can be exclusively handled with an expert model for better explainability and diversity, a sparse mechanism is employed to coordinate all the expert models in SPMoE. We assess the performance of our SPMoE on the paraphrase generation task and the experiment results prove that SPMoE can achieve a good balance in terms of quality, pattern-level diversity, and corpus-level diversity.
【5】 IsoScore: Measuring the Uniformity of Vector Space Utilization 标题:IsoScore:度量向量空间利用的均匀性 链接:https://arxiv.org/abs/2108.07344
作者:William Rudman,Nate Gillman,Taylor Rayne,Carsten Eickhoff 机构:Department of Computer Science, Brown University, Department of Mathematics, Brown University, Quest University 摘要:近年来,分布式单词表示的成功引起了人们对分析其空间分布特性的兴趣。当前的指标表明,当在向量空间中嵌入标记时,上下文化的单词嵌入模型不能统一地利用所有维度。在这里,我们认为现有的指标是脆弱的,往往混淆了点云的真实空间分布。为了改善这一问题,我们提出了IsoScore:一种新的度量标准,用于量化点云均匀利用环境向量空间的程度。我们证明了IsoScore具有一些理想的特性,如平均不变性和与所用维数的直接对应,这些特性是现有分数所不具备的。此外,IsoScore在概念上直观且计算效率高,非常适合分析任意向量空间中的点云分布,而不一定仅限于单词嵌入。此外,我们使用IsoScore来证明NLP文献中使用脆性空间分布度量(如平均余弦相似性)得出的一些最新结论可能不完整或完全不准确。 摘要:The recent success of distributed word representations has led to an increased interest in analyzing the properties of their spatial distribution. Current metrics suggest that contextualized word embedding models do not uniformly utilize all dimensions when embedding tokens in vector space. Here we argue that existing metrics are fragile and tend to obfuscate the true spatial distribution of point clouds. To ameliorate this issue, we propose IsoScore: a novel metric which quantifies the degree to which a point cloud uniformly utilizes the ambient vector space. We demonstrate that IsoScore has several desirable properties such as mean invariance and direct correspondence to the number of dimensions used, which are properties that existing scores do not possess. Furthermore, IsoScore is conceptually intuitive and computationally efficient, making it well suited for analyzing the distribution of point clouds in arbitrary vector spaces, not necessarily limited to those of word embeddings alone. Additionally, we use IsoScore to demonstrate that a number of recent conclusions in the NLP literature that have been derived using brittle metrics of spatial distribution, such as average cosine similarity, may be incomplete or altogether inaccurate.