cs.CL 方向,今日共计22篇
Transformer(1篇)
【1】 raceBERT -- A Transformer-based Model for Predicting Race from Names 标题:RaceBERT--一种基于Transformer的从名字预测种族的模型 链接:https://arxiv.org/abs/2112.03807
作者:Prasanna Parasurama 机构:New York University 备注:See this http URL 摘要:本文介绍了raceBERT——一个基于转换器的模型,用于从名称中的字符序列预测种族,以及一个附带的python包。使用在美国佛罗里达州选民登记数据集上训练的基于Transformer的模型,该模型预测一个名字属于5个美国人口普查种族类别(白人、黑人、西班牙裔、亚太岛民、美洲印第安人和阿拉斯加土著)的可能性。我以Sood和Laohaprapanon(2018)为基础,将他们的LSTM模型替换为基于Transformer的模型(预训练的BERT模型和从头训练的roBERTa模型),并比较结果。据我所知,raceBERT在使用姓名进行比赛预测方面取得了最先进的成绩,f1平均成绩为0.86分,比之前的最先进水平提高了4.1%,非白人姓名的成绩提高了15-17%。 摘要:This paper presents raceBERT -- a transformer-based model for predicting race from character sequences in names, and an accompanying python package. Using a transformer-based model trained on a U.S. Florida voter registration dataset, the model predicts the likelihood of a name belonging to 5 U.S. census race categories (White, Black, Hispanic, Asian & Pacific Islander, American Indian & Alaskan Native). I build on Sood and Laohaprapanon (2018) by replacing their LSTM model with transformer-based models (pre-trained BERT model, and a roBERTa model trained from scratch), and compare the results. To the best of my knowledge, raceBERT achieves state-of-the-art results in race prediction using names, with an average f1-score of 0.86 -- a 4.1% improvement over the previous state-of-the-art, and improvements between 15-17% for non-white names.
BERT(1篇)
【1】 Adapting BERT for Continual Learning of a Sequence of Aspect Sentiment Classification Tasks 标题:自适应BERT用于方面情感分类任务序列的连续学习 链接:https://arxiv.org/abs/2112.03271
作者:Zixuan Ke,Hu Xu,Bing Liu 机构:Department of Computer Science, University of Illinois at Chicago, Facebook AI Research 备注:None 摘要:本文研究了一系列方面情绪分类(ASC)任务的连续学习(CL)。虽然已经提出了一些用于文档情感分类的CL技术,但我们不知道在ASC上有任何CL工作。增量学习ASC任务序列的CL系统应解决以下两个问题:(1)将从以前任务中学习到的知识转移到新任务中,以帮助其学习更好的模型;(2)保持以前任务的模型性能,以便不会忘记它们。本文提出了一种新的基于胶囊网络的模型B-CL来解决这些问题。B-CL通过正向和反向知识转移,显著提高了ASC在新任务和旧任务上的绩效。通过大量实验证明了B-CL的有效性。 摘要:This paper studies continual learning (CL) of a sequence of aspect sentiment classification (ASC) tasks. Although some CL techniques have been proposed for document sentiment classification, we are not aware of any CL work on ASC. A CL system that incrementally learns a sequence of ASC tasks should address the following two issues: (1) transfer knowledge learned from previous tasks to the new task to help it learn a better model, and (2) maintain the performance of the models for previous tasks so that they are not forgotten. This paper proposes a novel capsule network based model called B-CL to address these issues. B-CL markedly improves the ASC performance on both the new task and the old tasks via forward and backward knowledge transfer. The effectiveness of B-CL is demonstrated through extensive experiments.
QA|VQA|问答|对话(2篇)
【1】 Automated Story Generation as Question-Answering 标题:作为问答的故事自动生成 链接:https://arxiv.org/abs/2112.03808
作者:Louis Castricato,Spencer Frazier,Jonathan Balloch,Nitya Tarakad,Mark Riedl 摘要:基于神经语言模型的故事自动生成方法有两个重要的局限性。首先,基于语言模型的故事生成器通常不会朝着给定的目标或结局工作。其次,随着故事越来越长,它们往往失去连贯性。我们提出了一种新的自动故事生成方法,该方法将问题视为生成性问答问题之一。我们建议的故事生成系统从封装故事最终事件的句子开始。然后,系统迭代地(1)分析描述最近事件的文本,(2)生成一个关于角色在事件中所做事情的“为什么”的问题,然后(3)尝试生成另一个回答此问题的前一事件。 摘要:Neural language model-based approaches to automated story generation suffer from two important limitations. First, language model-based story generators generally do not work toward a given goal or ending. Second, they often lose coherence as the story gets longer. We propose a novel approach to automated story generation that treats the problem as one of generative question-answering. Our proposed story generation system starts with sentences encapsulating the final event of the story. The system then iteratively (1) analyzes the text describing the most recent event, (2) generates a question about "why" a character is doing the thing they are doing in the event, and then (3) attempts to generate another, preceding event that answers this question.
【2】 Question Answering Survey: Directions, Challenges, Datasets, Evaluation Matrices 标题:问答调查:方向、挑战、数据集、评估矩阵 链接:https://arxiv.org/abs/2112.03572
作者:Hariom A. Pandya,Brijesh S. Bhatt 机构:Computer Engineering Department, Dharmsinh Desai University, Nadiad, Gujarat, India 摘要:在过去的十年里,互联网上可用信息的使用和数量都在增加。这种数字化导致需要自动答疑系统从冗余和过渡的知识源中提取丰富的信息。这些系统的设计是为了满足从这个巨大的知识源到使用自然语言理解(NLU)的用户查询的最突出的答案,因此明显依赖于问答(QA)领域。问答包括但不限于将用户问题映射到相关查询、检索相关信息、从检索到的信息中找到最合适的答案等步骤。目前对深度学习模型的改进表明,所有这些任务的性能都有显著提高。本文从问题类型、答案类型、证据来源、答案和建模方法等方面分析了质量保证领域的研究方向。这一细节之后是该领域的公开挑战,如自动问题生成、相似性检测和语言的低资源可用性。最后,对现有数据集和评价方法进行了综述。 摘要:The usage and amount of information available on the internet increase over the past decade. This digitization leads to the need for automated answering system to extract fruitful information from redundant and transitional knowledge sources. Such systems are designed to cater the most prominent answer from this giant knowledge source to the user query using natural language understanding (NLU) and thus eminently depends on the Question-answering(QA) field. Question answering involves but not limited to the steps like mapping of user question to pertinent query, retrieval of relevant information, finding the best suitable answer from the retrieved information etc. The current improvement of deep learning models evince compelling performance improvement in all these tasks. In this review work, the research directions of QA field are analyzed based on the type of question, answer type, source of evidence-answer, and modeling approach. This detailing followed by open challenges of the field like automatic question generation, similarity detection and, low resource availability for a language. In the end, a survey of available datasets and evaluation measures is presented.
语义分析(2篇)
【1】 Change Summarization of Diachronic Scholarly Paper Collections by Semantic Evolution Analysis 标题:基于语义演化分析的历时学术论文集变化总结 链接:https://arxiv.org/abs/2112.03634
作者:Naman Paharia,Muhammad Syafiq Mohd Pozi,Adam Jatowt 机构:IIT Kharagpur, Kharagpur, India, SOC, Universiti Utara Malaysia, IIR,., Universiti Kebangsaan Malaysia, University of Innsbruck, Innsbruck, Tirol, Austria 备注:4 pages, JCDL-2021 摘要:在过去几年中,学术数据的数量急剧增加。对于特定科学领域(如红外、物理学、NLP)的新手来说,通常很难发现更大的趋势,也很难根据先前的科学成就和突破来定位最新的研究。同样,科学史上的研究人员也对能够分析和可视化特定科学领域变化的工具感兴趣。时间摘要和相关方法对于理解随时间积累的大量科学论述数据应该是有用的。我们展示了一种新的方法来分析在较长时间段内发表的研究论文集,以提供随着时间推移发生的重要语义变化的高层次概述。我们的方法基于随着时间的推移比较词的语义表示,旨在支持用户更好地理解大型以领域为中心的学术出版物档案。作为一个示例数据集,我们使用了ACL文选参考语料库,该语料库从1979年到2015年,包含22878篇学术文章。 摘要:The amount of scholarly data has been increasing dramatically over the last years. For newcomers to a particular science domain (e.g., IR, physics, NLP) it is often difficult to spot larger trends and to position the latest research in the context of prior scientific achievements and breakthroughs. Similarly, researchers in the history of science are interested in tools that allow them to analyze and visualize changes in particular scientific domains. Temporal summarization and related methods should be then useful for making sense of large volumes of scientific discourse data aggregated over time. We demonstrate a novel approach to analyze the collections of research papers published over longer time periods to provide a high-level overview of important semantic changes that occurred over the progress of time. Our approach is based on comparing word semantic representations over time and aims to support users in a better understanding of large domain-focused archives of scholarly publications. As an example dataset we use the ACL Anthology Reference Corpus that spans from 1979 to 2015 and contains 22,878 scholarly articles.
【2】 Parsing with Pretrained Language Models, Multiple Datasets, and Dataset Embeddings 标题:使用预先训练的语言模型、多个数据集和数据集嵌入进行解析 链接:https://arxiv.org/abs/2112.03625
作者:Rob van der Goot,Miryam de Lhoneux 机构:IT University of Copenhagen, Uppsala University, KU Leuven 备注:Accepted to TLT at SyntaxFest 2021 摘要:随着数据集可用性的增加,从各种数据源学习的潜力也增加了。改进从多个数据源学习的一种特殊方法是在训练期间嵌入数据源。这允许模型学习可概括的特征以及数据集之间的区别特征。然而,在自然语言处理领域引入基于上下文转换的嵌入之前,这些数据集嵌入大多已经被使用。在这项工作中,我们比较了在基于转换器的多语言依赖解析器中嵌入数据集的两种方法,并进行了广泛的评估。我们表明:1)嵌入数据集对这些模型仍然是有益的2)在编码器级别嵌入数据集时性能提高最高3)毫不奇怪,我们确认,对于小数据集和基线分数较低的数据集,性能提高最高。4) 我们表明,对所有数据集进行组合的训练与基于语言相关性设计更小的集群的效果类似。 摘要:With an increase of dataset availability, the potential for learning from a variety of data sources has increased. One particular method to improve learning from multiple data sources is to embed the data source during training. This allows the model to learn generalizable features as well as distinguishing features between datasets. However, these dataset embeddings have mostly been used before contextualized transformer-based embeddings were introduced in the field of Natural Language Processing. In this work, we compare two methods to embed datasets in a transformer-based multilingual dependency parser, and perform an extensive evaluation. We show that: 1) embedding the dataset is still beneficial with these models 2) performance increases are highest when embedding the dataset at the encoder level 3) unsurprisingly, we confirm that performance increases are highest for small datasets and datasets with a low baseline score. 4) we show that training on the combination of all datasets performs similarly to designing smaller clusters based on language-relatedness.
Graph|知识图谱|Knowledge(2篇)
【1】 GKS: Graph-based Knowledge Selector for Task-oriented Dialog System 标题:GKS:面向任务的对话系统中基于图的知识选择器 链接:https://arxiv.org/abs/2112.03719
作者:Jen-Chieh Yang,Jia-Yan Wu,Sung-Ping Chang,Ya-Chieh Huang 机构: Academic Sinica, National Taiwan University, Columbia University 备注:Accepted to The Tenth Dialog System Technology Challenge workshop at Association for the Advancement of Artificial Intelligence (AAAI) 2022 摘要:在以往的研究中,知识选择任务大多依赖于基于语言模型的方法或知识排序。然而,这些方法仅仅依赖于语言模型,将所有知识作为顺序输入,在大多数情况下,这些知识不包含顺序信息。另一方面,知识排序方法利用对话历史和每个给定的知识,而不是在知识片段之间。在第十届对话系统技术挑战赛(DSTC 10)中,我们参加了第二轮基于知识的面向任务的口语对话建模。为了解决上述问题,我们改进了基于SOTA模型的第一和第三个子任务的训练方法,并提出了图形知识选择器(GKS),将图形注意库模型与语言模型结合用于知识选择子任务二。GKS通过同时考虑从语言模型生成的每个知识嵌入,在对话框中做出知识选择决策,而不考虑顺序特征。GKS还在决策过程中利用大量知识,将知识之间的关系作为选择过程的一部分。GKS在第九届Dialog系统技术挑战(DSTC9)的知识选择数据集中提出的几个SOTA模型中表现出色。 摘要:In previous research, knowledge selection tasks mostly rely on language model-based methods or knowledge ranking. However, approaches simply rely on the language model take all knowledge as sequential input that knowledge does not contain sequential information in most circumstances. On the other hand, the knowledge ranking method leverage dialog history and each given knowledge but not between pieces of knowledge. In the 10th Dialog System Technology Challenges (DSTC 10), we participated the second track of Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations. To deal with the problems mentioned above, we modified training methods based on SOTA models for the first and third sub-tasks and proposed Graph-Knowledge Selector (GKS), utilizing a graph-attention base model incorporated with language model for knowledge selection sub-task two. GKS makes knowledge selection decisions in the dialog by simultaneously considering each knowledge embedding generated from the language model, without sequential features. GKS also leverages considerable knowledge in the decision-making, takes relations across knowledge as a part of the selection process. GKS outperforms several SOTA models proposed in the data-set on knowledge selection from the 9th Dialog System Technology Challenges (DSTC9).
【2】 Improving Neural Cross-Lingual Summarization via Employing Optimal Transport Distance for Knowledge Distillation 标题:利用最优传输距离进行知识提炼改进神经跨语言摘要 链接:https://arxiv.org/abs/2112.03473
作者:Thong Nguyen,Luu Anh Tuan 机构: VinAI Research, Vietnam, Nanyang Technological University, Singapore 备注:Accepted by 36th AAAI Conference on Artificial Intelligence (AAAI 2022) 摘要:当前最先进的跨语言摘要模型采用多任务学习范式,该范式基于共享词汇模块,并依靠自我注意机制参与两种语言中的标记。然而,通过自我注意学习的相关性往往是松散的和隐含的,在捕捉语言之间的关键跨语言表征时效率低下。当在具有单独形态或结构特征的语言上执行时,情况会恶化,使得跨语言对齐更具挑战性,导致性能下降。为了克服这个问题,我们提出了一个新的基于知识提取的跨语言摘要框架,试图通过将单语摘要教师的知识提取到跨语言摘要学生的知识中来明确构建跨语言相关性。由于教师和学生的表示位于两个不同的向量空间上,我们进一步提出了一个知识提取损失,使用Sinkhorn散度(最佳传输距离)来估计这些教师和学生表示之间的差异。由于Sinkhorn分歧的直观几何性质,学生模型可以有效地学习将其产生的跨语言隐藏状态与单语隐藏状态对齐,从而导致远距离语言之间的强相关性。在两种远程语言的跨语言摘要数据集上的实验表明,在高资源和低资源环境下,我们的方法都优于最先进的模型。 摘要:Current state-of-the-art cross-lingual summarization models employ multi-task learning paradigm, which works on a shared vocabulary module and relies on the self-attention mechanism to attend among tokens in two languages. However, correlation learned by self-attention is often loose and implicit, inefficient in capturing crucial cross-lingual representations between languages. The matter worsens when performing on languages with separate morphological or structural features, making the cross-lingual alignment more challenging, resulting in the performance drop. To overcome this problem, we propose a novel Knowledge-Distillation-based framework for Cross-Lingual Summarization, seeking to explicitly construct cross-lingual correlation by distilling the knowledge of the monolingual summarization teacher into the cross-lingual summarization student. Since the representations of the teacher and the student lie on two different vector spaces, we further propose a Knowledge Distillation loss using Sinkhorn Divergence, an Optimal-Transport distance, to estimate the discrepancy between those teacher and student representations. Due to the intuitively geometric nature of Sinkhorn Divergence, the student model can productively learn to align its produced cross-lingual hidden states with monolingual hidden states, hence leading to a strong correlation between distant languages. Experiments on cross-lingual summarization datasets in pairs of distant languages demonstrate that our method outperforms state-of-the-art models under both high and low-resourced settings.
推理|分析|理解|解释(1篇)
【1】 Scaling Structured Inference with Randomization 标题:基于随机化的伸缩结构化推理 链接:https://arxiv.org/abs/2112.03638
作者:Yao Fu,Mirella Lapata 机构:Traditional dynamic programming based inference for ex-ponential families has limited scalability with large combi- 1Institute for Language 备注:Preprint 摘要:在深度学习时代,离散图形模型的状态空间规模对于模型能力至关重要。现有的基于动态规划(DP)的推理通常适用于少量状态(通常少于数百个)。在这项工作中,我们提出了一系列随机动态规划(RDP)算法,用于将结构化模型扩展到成千上万个潜在状态。我们的方法广泛适用于经典的基于DP的推理(划分、边缘、再参数化、熵等)和不同的图结构(链、树和更一般的超图)。它还与自动微分兼容,因此可以与神经网络无缝集成,并使用基于梯度的优化器进行学习。我们的核心技术是随机化,即限制和重新加权一小部分节点上的DP,从而将计算量减少几个数量级。通过Rao Blackwellization和重要性抽样,我们进一步实现了低偏差和方差。在不同图形上的不同推理实验证明了我们方法的准确性和有效性。此外,当使用RDP训练规模结构VAE时,它在测试可能性方面优于基线,并成功地防止了后塌陷。 摘要:The scale of the state space of discrete graphical models is crucial for model capacity in the era of deep learning. Existing dynamic programming (DP) based inference typically works with a small number of states (usually less than hundreds). In this work, we propose a family of randomized dynamic programming (RDP) algorithms for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference (partition, marginal, reparameterization, entropy, .etc) and different graph structures (chains, trees, and more general hypergraphs). It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly and learned with gradient-based optimizers. Our core technique is randomization, which is to restrict and reweight DP on a small selected subset of nodes, leading to computation reduction by orders of magnitudes. We further achieve low bias and variance with Rao-Blackwellization and importance sampling. Experiments on different inferences over different graphs demonstrate the accuracy and efficiency of our methods. Furthermore, when using RDP to train a scaled structured VAE, it outperforms baselines in terms of test likelihood and successfully prevents posterior collapse.
GAN|对抗|攻击|生成相关(1篇)
【1】 Natural Answer Generation: From Factoid Answer to Full-length Answer using Grammar Correction 标题:自然答案生成:从事实答案到使用语法校正的全长答案 链接:https://arxiv.org/abs/2112.03849
作者:Manas Jain,Sriparna Saha,Pushpak Bhattacharyya,Gladvin Chinnadurai,Manish Kumar Vatsa 机构:Indian Institute of Technology Bombay, Mumbai, Indian Institute of Technology Patna, LG Soft India 摘要:如今的问答系统通常使用基于模板的语言生成。虽然这些系统对于特定于域的任务来说是足够的,但是对于独立于域的系统来说,这些系统限制太多,并且是预定义的。本文提出了一个系统,该系统输出给定问题的完整答案和提取的factoid答案(短跨度,如命名实体)作为输入。我们的系统使用选区和依赖性问题解析树。基于转换器的语法纠错模型GECToR(2020)被用作后处理步骤,以提高流利性。我们将我们的系统与(i)改进的指针生成器(SOTA)和(ii)针对factoid问题的微调DialoGPT进行比较。我们还测试了存在性(是非)问题的方法,结果更好。与最先进的(SOTA)方法相比,我们的模型生成准确、流畅的答案。评估是在NewsQA和SqUAD数据集上进行的,ROUGE-1得分分别增加0.4和0.9个百分点。与SOTA相比,推理时间也减少了85%。用于我们评估的改进数据集将作为研究贡献的一部分发布。 摘要:Question Answering systems these days typically use template-based language generation. Though adequate for a domain-specific task, these systems are too restrictive and predefined for domain-independent systems. This paper proposes a system that outputs a full-length answer given a question and the extracted factoid answer (short spans such as named entities) as the input. Our system uses constituency and dependency parse trees of questions. A transformer-based Grammar Error Correction model GECToR (2020), is used as a post-processing step for better fluency. We compare our system with (i) Modified Pointer Generator (SOTA) and (ii) Fine-tuned DialoGPT for factoid questions. We also test our approach on existential (yes-no) questions with better results. Our model generates accurate and fluent answers than the state-of-the-art (SOTA) approaches. The evaluation is done on NewsQA and SqUAD datasets with an increment of 0.4 and 0.9 percentage points in ROUGE-1 score respectively. Also the inference time is reduced by 85% as compared to the SOTA. The improved datasets used for our evaluation will be released as part of the research contribution.
识别/分类(1篇)
【1】 CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification 标题:CMA-CLIP:用于图文分类的跨通道注意剪辑 链接:https://arxiv.org/abs/2112.03562
作者:Huidong Liu,Shaoyuan Xu,Jinmiao Fu,Yang Liu,Ning Xie,Chien-chih Wang,Bryan Wang,Yi Sun 机构:Stony Brook University, Stony Brook, NY, USA, Amazon Inc., Seattle, WA, USA 摘要:社交媒体和电子商务等现代网络系统包含以图像和文本表示的丰富内容。利用来自多种模式的信息可以提高机器学习任务(如分类和推荐)的性能。在本文中,我们提出了跨模态注意对比语言图像预训练(CMA-CLIP),这是一种新的框架,它将两种跨模态注意(顺序注意和模态注意)结合起来,以有效地融合图像和文本对中的信息。序列式注意使框架能够捕获图像块和文本标记之间的细粒度关系,而模态式注意则通过其与下游任务的相关性来衡量每个模态。此外,通过添加任务特定的模态注意和多层感知器,我们提出的框架能够执行多模态的多任务分类。我们在一个主要零售网站产品属性(MRWPA)数据集和两个公共数据集Food101和Fashion-Gen上进行了实验。结果表明,CMA-CLIP在多任务分类的MRWPA数据集上,在相同精度水平下,召回率平均比预训练和微调的CLIP高11.9%。它在精度上也比Fashion Gen数据集上的最新方法高出5.5%,并在Food101数据集上实现了具有竞争力的性能。通过详细的消融研究,我们进一步证明了跨模态注意模块的有效性以及我们的方法对图像和文本输入噪声的鲁棒性,这是实践中的一个常见挑战。 摘要:Modern Web systems such as social media and e-commerce contain rich contents expressed in images and text. Leveraging information from multi-modalities can improve the performance of machine learning tasks such as classification and recommendation. In this paper, we propose the Cross-Modality Attention Contrastive Language-Image Pre-training (CMA-CLIP), a new framework which unifies two types of cross-modality attentions, sequence-wise attention and modality-wise attention, to effectively fuse information from image and text pairs. The sequence-wise attention enables the framework to capture the fine-grained relationship between image patches and text tokens, while the modality-wise attention weighs each modality by its relevance to the downstream tasks. In addition, by adding task specific modality-wise attentions and multilayer perceptrons, our proposed framework is capable of performing multi-task classification with multi-modalities. We conduct experiments on a Major Retail Website Product Attribute (MRWPA) dataset and two public datasets, Food101 and Fashion-Gen. The results show that CMA-CLIP outperforms the pre-trained and fine-tuned CLIP by an average of 11.9% in recall at the same level of precision on the MRWPA dataset for multi-task classification. It also surpasses the state-of-the-art method on Fashion-Gen Dataset by 5.5% in accuracy and achieves competitive performance on Food101 Dataset. Through detailed ablation studies, we further demonstrate the effectiveness of both cross-modality attention modules and our method's robustness against noise in image and text inputs, which is a common challenge in practice.
Word2Vec|文本|单词(2篇)
【1】 Multi-speaker Emotional Text-to-speech Synthesizer 标题:多说话人情感文语合成器 链接:https://arxiv.org/abs/2112.03557
作者:Sungjae Cho,Soo-Young Lee 机构:Korea Institute of Science and Technology, Republic of Korea, Korea Advanced Institute of Science and Technology, Republic of Korea 备注:None 摘要:我们提出了一种方法来训练我们的多说话人情感文本到语音合成器,它可以表达10个说话人的7种不同情感。在学习之前,将删除音频样本中的所有静音。这使得我们的模型能够快速学习。课程学习被用于有效地训练我们的模型。我们的模型首先使用一个大的单说话人中性数据集进行训练,然后使用所有说话人的中性语音进行训练。最后,使用来自所有说话人的情感语音数据集对我们的模型进行训练。在每个阶段中,每个说话人情感对的训练样本以相同的概率出现在小批量中。通过这个过程,我们的模型可以合成所有目标说话人的语音和情感。我们的合成音频集可在我们的网页上找到。 摘要:We present a methodology to train our multi-speaker emotional text-to-speech synthesizer that can express speech for 10 speakers' 7 different emotions. All silences from audio samples are removed prior to learning. This results in fast learning by our model. Curriculum learning is applied to train our model efficiently. Our model is first trained with a large single-speaker neutral dataset, and then trained with neutral speech from all speakers. Finally, our model is trained using datasets of emotional speech from all speakers. In each stage, training samples of each speaker-emotion pair have equal probability to appear in mini-batches. Through this procedure, our model can synthesize speech for all targeted speakers and emotions. Our synthesized audio sets are available on our web page.
【2】 Ground-Truth, Whose Truth? -- Examining the Challenges with Annotating Toxic Text Datasets 标题:地面真相,谁的真相?--用标注有毒文本数据集的方式审视挑战 链接:https://arxiv.org/abs/2112.03529
作者:Kofi Arhin,Ioana Baldini,Dennis Wei,Karthikeyan Natesan Ramamurthy,Moninder Singh 机构: Lally School of Management, Rensselaer Polytechnic Institute 备注:15 pages 摘要:使用基于机器学习(ML)的语言模型(LMs)在线监控内容的趋势正在上升。对于有毒文本识别,使用注释员标记的数据集对这些模型进行特定任务的微调,注释员提供基本事实标签,以区分攻击性内容和正常内容。随着时间的推移,这些项目导致了大型数据集的开发、改进和扩展,并为自然语言研究做出了巨大贡献。尽管取得了这些成就,但现有证据表明,基于这些数据集构建的ML模型并不总能产生理想的结果。因此,本研究采用设计科学研究(DSR)方法,对选定的有毒文本数据集进行研究,目的是阐明一些固有问题,并为现有和未来项目应对这些挑战的讨论做出贡献。为了实现这项研究的目标,我们对来自三个有毒文本数据集的样本进行了重新注释,发现注释有毒文本样本的多标签方法有助于提高数据集的质量。虽然这种方法可能不会改进注释者之间一致性的传统度量,但它可以更好地捕获注释者对上下文的依赖性和多样性。我们讨论了这些结果对理论和实践的影响。 摘要:The use of machine learning (ML)-based language models (LMs) to monitor content online is on the rise. For toxic text identification, task-specific fine-tuning of these models are performed using datasets labeled by annotators who provide ground-truth labels in an effort to distinguish between offensive and normal content. These projects have led to the development, improvement, and expansion of large datasets over time, and have contributed immensely to research on natural language. Despite the achievements, existing evidence suggests that ML models built on these datasets do not always result in desirable outcomes. Therefore, using a design science research (DSR) approach, this study examines selected toxic text datasets with the goal of shedding light on some of the inherent issues and contributing to discussions on navigating these challenges for existing and future projects. To achieve the goal of the study, we re-annotate samples from three toxic text datasets and find that a multi-label approach to annotating toxic text samples can help to improve dataset quality. While this approach may not improve the traditional metric of inter-annotator agreement, it may better capture dependence on context and diversity in annotators. We discuss the implications of these results for both theory and practice.
其他神经网络|深度学习|模型|建模(1篇)
【1】 A deep language model to predict metabolic network equilibria 标题:预测代谢网络平衡的深层语言模型 链接:https://arxiv.org/abs/2112.03588
作者:François Charton,Amaury Hayat,Sean T. McQuade,Nathaniel J. Merrill,Benedetto Piccoli 机构: Facebook AI Research, CERMICS, Ecole des Ponts ParisTech, Champs-sur-Marne, France, Department of Mathematical Sciences and Center for Computational and Integrative, Biology, Rutgers University–Camden, Cooper St, Camden, NJ, USA 摘要:我们表明,深度学习模型,特别是像Transformer这样最初用于自然语言的架构,可以在随机生成的数据集上进行训练,以非常高的精度预测代谢网络的定性和定量特征。使用标准的数学技术,我们创建了大量(4000万个元素)的随机网络,可以用来训练我们的模型。这些经过训练的模型可以在99%以上的情况下预测随机图上的网络平衡。它们还可以推广到与训练中遇到的图结构不同的图。最后,他们可以几乎完美地预测一小部分已知生物网络的平衡。我们的方法在实验数据上非常经济,并且只使用小的、浅的、深的学习模型,与机器翻译中常用的大型体系结构相去甚远。这些结果为在定量系统药理学、系统生物学和合成生物学等关键领域更广泛地使用与生物网络相关的问题的深度学习模型铺平了道路。 摘要:We show that deep learning models, and especially architectures like the Transformer, originally intended for natural language, can be trained on randomly generated datasets to predict to very high accuracy both the qualitative and quantitative features of metabolic networks. Using standard mathematical techniques, we create large sets (40 million elements) of random networks that can be used to train our models. These trained models can predict network equilibrium on random graphs in more than 99% of cases. They can also generalize to graphs with different structure than those encountered at training. Finally, they can predict almost perfectly the equilibria of a small set of known biological networks. Our approach is both very economical in experimental data and uses only small and shallow deep-learning model, far from the large architectures commonly used in machine translation. Such results pave the way for larger use of deep learning models for problems related to biological networks in key areas such as quantitative systems pharmacology, systems biology, and synthetic biology.
其他(8篇)
【1】 Reducing Target Group Bias in Hate Speech Detectors 标题:减少仇恨言语检测器中的目标群体偏见 链接:https://arxiv.org/abs/2112.03858
作者:Darsh J Shah,Sinong Wang,Han Fang,Hao Ma,Luke Zettlemoyer 机构:Meta AI Research 摘要:攻击性和仇恨性内容在在线论坛上无处不在,因此需要自动解决方案,以便跨目标群体有效地检测此类内容。在本文中,我们表明,在大型公共可用数据集上训练的文本分类模型尽管具有较高的总体性能,但在几个受保护组上的性能可能显著不足。在citet{vidgen2020learning}数据集上,我们发现,在注释不足的黑人女性目标群体中,准确率较低37%,而在移民中,仇恨言论涉及一种独特的风格,准确率较低12%。为了解决这个问题,我们建议执行令牌级别的仇恨感知消歧,并利用令牌的仇恨感知表示进行检测,对更一般的信号进行建模。在两个公开可用的数据集上,我们观察到目标群体之间的模型准确性差异至少下降了30%,使目标群体的平均绩效提高了4%,最坏情况下的绩效提高了13%。 摘要:The ubiquity of offensive and hateful content on online fora necessitates the need for automatic solutions that detect such content competently across target groups. In this paper we show that text classification models trained on large publicly available datasets despite having a high overall performance, may significantly under-perform on several protected groups. On the citet{vidgen2020learning} dataset, we find the accuracy to be 37% lower on an under annotated Black Women target group and 12% lower on Immigrants, where hate speech involves a distinct style. To address this, we propose to perform token-level hate sense disambiguation, and utilize tokens' hate sense representations for detection, modeling more general signals. On two publicly available datasets, we observe that the variance in model accuracy across target groups drops by at least 30%, improving the average target group performance by 4% and worst case performance by 13%.
【2】 Grounded Language-Image Pre-training 标题:扎根的语言-形象预训 链接:https://arxiv.org/abs/2112.03857
作者:Liunian Harold Li,Pengchuan Zhang,Haotian Zhang,Jianwei Yang,Chunyuan Li,Yiwu Zhong,Lijuan Wang,Lu Yuan,Lei Zhang,Jenq-Neng Hwang,Kai-Wei Chang,Jianfeng Gao 机构:UCLA,Microsoft Research,University of Washington, University of Wisconsin-Madison,Microsoft Cloud and AI,International Digital Economy Academy 备注:Code will be released at this https URL 摘要:本文提出了一个用于学习对象级、语言感知和语义丰富的视觉表征的扎根语言图像预训练(GLIP)模型。GLIP将目标检测和短语基础统一用于预训练。这种统一带来了两个好处:1)它允许GLIP从检测和接地数据中学习,以改进任务和引导良好的接地模型;2) GLIP可以利用大量的图像-文本对,以自我训练的方式生成基础框,使学习到的表示语义丰富。在我们的实验中,我们在27M的基础数据上预训练GLIP,包括3M的人类注释和24M的网络爬网图像-文本对。学习到的表示法显示出很强的Zero-Shot和少量镜头可转移到各种对象级识别任务。1) 当直接在COCO和LVIS上进行评估时(在训练前没有看到COCO中的任何图像),GLIP分别达到49.8 AP和26.9 AP,超过了许多监督基线。2) 在COCO上进行微调后,GLIP在val上达到60.8 AP,在测试开发上达到61.5 AP,超过了之前的SoTA。3) 当转移到13个下游目标检测任务时,一个单发GLIP与一个完全监督的动态头部相匹敌。守则将于https://github.com/microsoft/GLIP. 摘要:This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head. Code will be released at https://github.com/microsoft/GLIP.
【3】 A pragmatic account of the weak evidence effect 标题:证据效力弱化的语用解释 链接:https://arxiv.org/abs/2112.03799
作者:Samuel A. Barnett,Robert D. Hawkins,Thomas L. Griffiths 机构:Department of Computer Science, Princeton University, Princeton, New Jersey, Department of Psychology, Princeton University, Princeton, New Jersey, arXiv:,.,v, [cs.CL] , Dec 备注:Under review 摘要:语言不仅仅是用来传达信息的。我们经常通过支持某一特定观点的辩论来寻求说服。说服对信念更新的经典解释提出了许多挑战,因为信息不能从表面上看。听众在吸收新信息时应该如何解释演讲者的“隐藏议程”?在这里,我们扩展了最近递归社会推理的概率模型,以考虑说服目标,并表明我们的模型为弱有利论点可能适得其反提供了一种新的语用解释,这种现象称为弱证据效应。关键的是,我们的模型预测了信念更新和说话人期望之间的关系:当说话人被期望在有说服力的目标下行动时,弱证据只会适得其反,这意味着缺乏更有力的证据。我们引入了一个简单的实验范式,称为棍棒竞赛,来衡量弱证据效应在多大程度上取决于说话人的期望,并表明语用听者模型比替代模型更好地解释了经验数据。我们的发现为社会推理的理性模型提供了进一步阐明决策现象的潜在途径。 摘要:Language is not only used to inform. We often seek to persuade by arguing in favor of a particular view. Persuasion raises a number of challenges for classical accounts of belief updating, as information cannot be taken at face value. How should listeners account for a speaker's "hidden agenda" when incorporating new information? Here, we extend recent probabilistic models of recursive social reasoning to allow for persuasive goals and show that our model provides a new pragmatic explanation for why weakly favorable arguments may backfire, a phenomenon known as the weak evidence effect. Critically, our model predicts a relationship between belief updating and speaker expectations: weak evidence should only backfire when speakers are expected to act under persuasive goals, implying the absence of stronger evidence. We introduce a simple experimental paradigm called the Stick Contest to measure the extent to which the weak evidence effect depends on speaker expectations, and show that a pragmatic listener model accounts for the empirical data better than alternative models. Our findings suggest potential avenues for rational models of social reasoning to further illuminate decision-making phenomena.
【4】 UCD-CS at TREC 2021 Incident Streams Track 标题:TREC 2021事件流跟踪的UCD-CS 链接:https://arxiv.org/abs/2112.03737
作者:Congcong Wang,David Lillis 机构:School of Computer Science, University College Dublin, Dublin, Ireland 摘要:近年来,在危机期间从社交媒体帖子中挖掘重要信息的任务已成为协助应急响应的研究重点。TREC事件流(IS)跟踪是为此目的组织的一项研究挑战。该跟踪要求参与系统将危机相关推文分类为人道主义援助相关信息类型,并评估其重要性。前者指多标签信息类型分类任务,后者指优先级估计任务。本文报告了都柏林大学计算机学院(UCD-CS)在TCRE-IS 2021中的参与情况。我们探索了多种方法,包括简单的机器学习算法、多任务学习技术、文本增强和集成方法。官方评估结果表明,我们的跑步成绩在许多指标上都达到了最高分数。为了帮助再现性,我们的代码在https://github.com/wangcongcong123/crisis-mtl. 摘要:In recent years, the task of mining important information from social media posts during crises has become a focus of research for the purposes of assisting emergency response (ES). The TREC Incident Streams (IS) track is a research challenge organised for this purpose. The track asks participating systems to both classify a stream of crisis-related tweets into humanitarian aid related information types and estimate their importance regarding criticality. The former refers to a multi-label information type classification task and the latter refers to a priority estimation task. In this paper, we report on the participation of the University College Dublin School of Computer Science (UCD-CS) in TREC-IS 2021. We explored a variety of approaches, including simple machine learning algorithms, multi-task learning techniques, text augmentation, and ensemble approaches. The official evaluation results indicate that our runs achieve the highest scores in many metrics. To aid reproducibility, our code is publicly available at https://github.com/wangcongcong123/crisis-mtl.
【5】 UNITER-Based Situated Coreference Resolution with Rich Multimodal Input 标题:基于单位元的富多模态输入条件共指消解 链接:https://arxiv.org/abs/2112.03521
作者:Yichen Huang,Yuchen Wang,Yik-Cheung Tam 机构: New York University Shanghai 摘要:作为第十届对话系统技术挑战赛(DSTC10)的一部分,我们介绍了我们在情境和交互式多模态对话2.0(SIMMC 2.0)数据集的多模态共指消解任务方面的工作。我们提出了一个基于UNITER的模型,利用丰富的多模态上下文(如文本对话历史、对象知识库和可视对话场景)来确定当前场景中的每个对象是否在当前对话回合中被提及。结果表明,所提出的方法大大优于官方DSTC10基线,在开发集上,对象F1得分从36.6%提高到77.3%,证明了所提出的对象表示法在丰富的多模态输入中的有效性。我们的模型在对象共指消解任务的官方评估中排名第二,模型融合后F1得分为73.3%。 摘要:We present our work on the multimodal coreference resolution task of the Situated and Interactive Multimodal Conversation 2.0 (SIMMC 2.0) dataset as a part of the tenth Dialog System Technology Challenge (DSTC10). We propose a UNITER-based model utilizing rich multimodal context such as textual dialog history, object knowledge base and visual dialog scenes to determine whether each object in the current scene is mentioned in the current dialog turn. Results show that the proposed approach outperforms the official DSTC10 baseline substantially, with the object F1 score boosted from 36.6% to 77.3% on the development set, demonstrating the effectiveness of the proposed object representations from rich multimodal input. Our model ranks second in the official evaluation on the object coreference resolution task with an F1 score of 73.3% after model ensembling.
【6】 Dataset Geography: Mapping Language Data to Language Users 标题:数据集地理:将语言数据映射到语言用户 链接:https://arxiv.org/abs/2112.03497
作者:Fahim Faisal,Yinkai Wang,Antonios Anastasopoulos 机构:Department of Computer Science, George Mason University, USA 摘要:随着语言技术变得越来越普遍,人们越来越努力扩大自然语言处理(NLP)系统的语言多样性和覆盖范围。可以说,影响现代NLP系统质量的最重要因素是数据可用性。在这项工作中,我们研究了NLP数据集的地理代表性,旨在量化NLP数据集是否以及在多大程度上符合语言使用者的预期需求。在此过程中,我们使用实体识别和链接系统,对它们的跨语言一致性进行重要观察,并为更稳健的评估提供建议。最后,我们探讨了一些地理和经济因素,可以解释观察到的数据集分布。代码和数据可在此处获得:https://github.com/ffaisal93/dataset_geography. 此处提供了其他可视化效果:https://nlp.cs.gmu.edu/project/datasetmaps/. 摘要:As language technologies become more ubiquitous, there are increasing efforts towards expanding the language diversity and coverage of natural language processing (NLP) systems. Arguably, the most important factor influencing the quality of modern NLP systems is data availability. In this work, we study the geographical representativeness of NLP datasets, aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers. In doing so, we use entity recognition and linking systems, also making important observations about their cross-lingual consistency and giving suggestions for more robust evaluation. Last, we explore some geographical and economic factors that may explain the observed dataset distributions. Code and data are available here: https://github.com/ffaisal93/dataset_geography. Additional visualizations are available here: https://nlp.cs.gmu.edu/project/datasetmaps/.
【7】 JUSTICE: A Benchmark Dataset for Supreme Court's Judgment Prediction 标题:正义:最高法院判决预测的基准数据集 链接:https://arxiv.org/abs/2112.03414
作者:Mohammad Alali,Shaayan Syed,Mohammed Alsayed,Smit Patel,Hemanth Bodala 机构:University of Southern California 备注:6 pages, 6 figures 摘要:最近,人工智能在许多领域得到了应用,法律体系也不例外。然而,就目前而言,与美国最高法院(SCOTUS)法律文件相关的注释良好的数据集数量非常有限,供公众使用。尽管最高法院的裁决是公共领域的知识,但由于每次都需要从零开始手动收集和处理数据,因此尝试对其进行有意义的工作就成了一项更大的任务。因此,我们的目标是创建一个高质量的SCOTUS法院案例数据集,以便在自然语言处理(NLP)研究和其他数据驱动的应用中方便地使用它们。此外,NLP的最新进展为我们提供了构建预测模型的工具,可用于揭示影响法院判决的模式。通过使用先进的NLP算法来分析以前的法院案例,经过训练的模型能够根据原告和被告提供的案件事实以文本形式预测和分类法院的判决;换言之,该模型通过生成最终裁决来模拟人类陪审团。 摘要:Artificial intelligence is being utilized in many domains as of late, and the legal system is no exception. However, as it stands now, the number of well-annotated datasets pertaining to legal documents from the Supreme Court of the United States (SCOTUS) is very limited for public use. Even though the Supreme Court rulings are public domain knowledge, trying to do meaningful work with them becomes a much greater task due to the need to manually gather and process that data from scratch each time. Hence, our goal is to create a high-quality dataset of SCOTUS court cases so that they may be readily used in natural language processing (NLP) research and other data-driven applications. Additionally, recent advances in NLP provide us with the tools to build predictive models that can be used to reveal patterns that influence court decisions. By using advanced NLP algorithms to analyze previous court cases, the trained models are able to predict and classify a court's judgment given the case's facts from the plaintiff and the defendant in textual format; in other words, the model is emulating a human jury by generating a final verdict.
【8】 EmTract: Investor Emotions and Market Behavior 标题:EmTract:投资者情绪和市场行为 链接:https://arxiv.org/abs/2112.03868
作者:Domonkos Vamossy,Rolf Skog 摘要:我们开发了一个从社交媒体文本数据中提取情感的工具。我们的方法有三个主要优点。首先,它是为金融环境量身定做的;其次,它整合了社交媒体数据的关键方面,如非标准短语、表情符号和表情符号;第三,它通过顺序学习潜在的表示来操作,该表示包括诸如词序、单词用法和本地上下文等特征。此工具以及用户指南可从以下网站获得:https://github.com/dvamossy/EmTract. 利用EmTract,我们探讨了在社交媒体上表达的投资者情绪与资产价格之间的关系。我们记录了一些有趣的见解。首先,我们确认了一些受控实验室实验的发现,这些实验将投资者情绪与资产价格变动联系起来。其次,我们证明投资者情绪可以预测每日价格变动。当波动性或短期利率较高时,以及当机构所有权或流动性较低时,这些影响更大。第三,IPO前投资者热情的提高导致IPO股票的首日回报率较高,长期表现不佳。为了证实我们的结果,我们提供了一些鲁棒性检查,包括使用另一种情绪模型。我们的发现强化了情绪与市场动态密切相关的直觉,并强调了在评估股票短期价值时考虑投资者情绪的重要性。 摘要:We develop a tool that extracts emotions from social media text data. Our methodology has three main advantages. First, it is tailored for financial context; second, it incorporates key aspects of social media data, such as non-standard phrases, emojis and emoticons; and third, it operates by sequentially learning a latent representation that includes features such as word order, word usage, and local context. This tool, along with a user guide is available at: https://github.com/dvamossy/EmTract. Using EmTract, we explore the relationship between investor emotions expressed on social media and asset prices. We document a number of interesting insights. First, we confirm some of the findings of controlled laboratory experiments relating investor emotions to asset price movements. Second, we show that investor emotions are predictive of daily price movements. These impacts are larger when volatility or short interest are higher, and when institutional ownership or liquidity are lower. Third, increased investor enthusiasm prior to the IPO contributes to the large first-day return and long-run underperformance of IPO stocks. To corroborate our results, we provide a number of robustness checks, including using an alternative emotion model. Our findings reinforce the intuition that emotions and market dynamics are closely related, and highlight the importance of considering investor emotions when assessing a stock's short-term value.
机器翻译,仅供参考