自然语言处理学术速递[7.15]

2021-07-27 10:58:35 浏览数 (1)

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

cs.CL 方向,今日共计20篇

Transformer(1篇)

【1】 Indonesia's Fake News Detection using Transformer Network 标题:印尼利用Transformer网络检测假新闻

作者:Aisyah Awalina,Jibran Fawaid,Rifky Yunus Krisnabayu,Novanto Yudistira 机构: Indonesia’s Fake News Detection using Transformer Network Aisyah Awalina Brawijaya University, id Jibran Fawaid Brawijaya University, id Rifky Yunus Krisnabayu Brawijaya University, com Novanto Yudistira Brawijaya University 链接:https://arxiv.org/abs/2107.06796 摘要:假新闻是这个时代社会面临的问题。假新闻给人民群众带来挑衅和问题的并不少见。印尼作为人口第四大国,在处理假新闻方面存在问题。超过30%的农村和城市人口被这个假新闻问题所欺骗。在我们的研究中,关于印尼巴哈萨防止虚假新闻传播的文献很少。因此,本研究旨在预防这些问题的发生。本研究中使用的数据集来自一个识别假新闻的新闻门户网站turnbackhoax.id。通过对该网页进行刮削,得到1116条有效新闻和虚假新闻数据。可以在以下位置访问数据集:https://github.com/JibranFawaid/turnbackhoax-dataset. 此数据集将与其他可用数据集合并。所使用的方法有CNN、BiLSTM、混合CNN-BiLSTM和带Transformer网络的BERT。研究表明,基于Transformer网络的BERT方法具有最好的结果,其精度可达90%。 摘要:Fake news is a problem faced by society in this era. It is not rare for fake news to cause provocation and problem for the people. Indonesia, as a country with the 4th largest population, has a problem in dealing with fake news. More than 30% of rural and urban population are deceived by this fake news problem. As we have been studying, there is only few literatures on preventing the spread of fake news in Bahasa Indonesia. So, this research is conducted to prevent these problems. The dataset used in this research was obtained from a news portal that identifies fake news, turnbackhoax.id. Using Web Scrapping on this page, we got 1116 data consisting of valid news and fake news. The dataset can be accessed at https://github.com/JibranFawaid/turnbackhoax-dataset. This dataset will be combined with other available datasets. The methods used are CNN, BiLSTM, Hybrid CNN-BiLSTM, and BERT with Transformer Network. This research shows that the BERT method with Transformer Network has the best results with an accuracy of up to 90%.

BERT(3篇)

【1】 BERT Fine-Tuning for Sentiment Analysis on Indonesian Mobile Apps Reviews 标题:印尼移动应用评论情感分析的BERT微调

作者:Kuncahyo Setyo Nugroho,Anantha Yullian Sukmadewa,Haftittah Wuswilahaken DW,Fitra Abdurrachman Bachtiar,Novanto Yudistira 机构: Brawijaya University 链接:https://arxiv.org/abs/2107.06802 摘要:用户评论对已开发的移动应用程序的成功起着至关重要的作用。文本形式的用户评论是非结构化数据,在处理情感分析时会产生非常高的复杂性。以前使用的方法往往忽略了审查的背景。另外,相对较小的数据使得模型拟合过度。一种新的方法,BERT,作为一种迁移学习模型,已经被引入到一个预先训练的模型中,该模型已经被训练成具有更好的上下文表示。本研究使用两个不同的预训练模型来检验微调BERT在情绪分析中的有效性。除了多语种的预训练模型外,我们还使用了只接受过印尼语训练的预训练模型。使用的数据集是印尼用户对谷歌Play网站2020年十大最佳应用的评论。我们还进行超参数调整,以找到最佳的训练模型。两种训练数据标注方法也进行了测试,以确定该模型的有效性,即基于分数和基于词汇的。实验结果表明,印尼语预训练模型对基于词汇的数据具有较好的平均准确率。预先训练的印尼模型最高准确率为84%,25个学时,训练时间为24分钟。这些结果都优于所有的机器学习和多语种预训练模型。 摘要:User reviews have an essential role in the success of the developed mobile apps. User reviews in the textual form are unstructured data, creating a very high complexity when processed for sentiment analysis. Previous approaches that have been used often ignore the context of reviews. In addition, the relatively small data makes the model overfitting. A new approach, BERT, has been introduced as a transfer learning model with a pre-trained model that has previously been trained to have a better context representation. This study examines the effectiveness of fine-tuning BERT for sentiment analysis using two different pre-trained models. Besides the multilingual pre-trained model, we use the pre-trained model that only has been trained in Indonesian. The dataset used is Indonesian user reviews of the ten best apps in 2020 in Google Play sites. We also perform hyper-parameter tuning to find the optimum trained model. Two training data labeling approaches were also tested to determine the effectiveness of the model, which is score-based and lexicon-based. The experimental results show that pre-trained models trained in Indonesian have better average accuracy on lexicon-based data. The pre-trained Indonesian model highest accuracy is 84%, with 25 epochs and a training time of 24 minutes. These results are better than all of the machine learning and multilingual pre-trained models.

【2】 Large-Scale News Classification using BERT Language Model: Spark NLP Approach 标题:基于BERT语言模型的大规模新闻分类:电光自然语言处理方法

作者:Kuncahyo Setyo Nugroho,Kuncahyo Setyo Nugroho,Novanto Yudistira 机构:Department of Informatics Engineering, Brawijaya University, Indonesia, Anantha Yullian Sukmadewa 链接:https://arxiv.org/abs/2107.06785 摘要:大数据分析在NLP之上的兴起增加了大规模文本处理的计算负担。自然语言处理所面临的问题是高维文本,因此需要大量的计算资源。MapReduce允许大型计算的并行化,可以提高文本处理的效率。本研究旨在探讨大数据处理对自然语言处理任务的影响。我们使用预先训练好的模型对新闻主题的大文本进行微调分类。在本研究中,我们使用了五个不同参数的预训练模型。为了测试这种方法的效率,我们将BERT的性能与Spark-NLP的流水线进行了比较。结果表明,与带Spark-NLP的BERT相比,不带Spark-NLP的BERT具有更高的精度。在Spark-NLP流水线上使用BERT时,所有模型的平均精度和训练时间分别为0.9187和35分钟,而在Spark-NLP流水线上使用BERT的平均精度和训练时间分别为0.8444和9分钟。较大的模型将占用更多的计算资源,并且需要更长的时间来完成任务。然而,与未使用Spark-NLP的BERT相比,使用Spark-NLP的BERT的准确率平均只下降了5.7%,而训练时间则显著减少了62.9%。 摘要:The rise of big data analytics on top of NLP increases the computational burden for text processing at scale. The problems faced in NLP are very high dimensional text, so it takes a high computation resource. The MapReduce allows parallelization of large computations and can improve the efficiency of text processing. This research aims to study the effect of big data processing on NLP tasks based on a deep learning approach. We classify a big text of news topics with fine-tuning BERT used pre-trained models. Five pre-trained models with a different number of parameters were used in this study. To measure the efficiency of this method, we compared the performance of the BERT with the pipelines from Spark NLP. The result shows that BERT without Spark NLP gives higher accuracy compared to BERT with Spark NLP. The accuracy average and training time of all models using BERT is 0.9187 and 35 minutes while using BERT with Spark NLP pipeline is 0.8444 and 9 minutes. The bigger model will take more computation resources and need a longer time to complete the tasks. However, the accuracy of BERT with Spark NLP only decreased by an average of 5.7%, while the training time was reduced significantly by 62.9% compared to BERT without Spark NLP.

【3】 Using BERT Encoding to Tackle the Mad-lib Attack in SMS Spam Detection 标题:利用BERT编码撞击垃圾短信检测中的Mad-lib攻击

作者:Sergio Rojas-Galeano 机构:Universidad Distrital Francisco Jos´e de Caldas 链接:https://arxiv.org/abs/2107.06400 摘要:欺骗垃圾邮件过滤器的策略之一是用同义词或类似的词来代替语音,这些词会使检测算法无法识别邮件。在本文中,我们研究了最近开发的对单词的语义和上下文敏感的语言模型,例如Google的BERT,是否有助于克服这种对抗性攻击(根据单词替换游戏称为“Mad lib”)。利用5572条短信的数据集,我们首先使用广为人知的文档表示模型(BoW和TFIDF)和新的BERT模型,结合多种分类算法(决策树、kNN、SVM、Logistic回归、朴素贝叶斯、多层感知器)建立了检测性能的基线。然后,我们建立了一个包含在这些信息中的词汇的词库,并建立了一个Mad-lib攻击实验,在这个实验中,我们修改了一个被保留的数据子集(在基线实验中没有使用)的每一条信息,用词库中的同义词替换原始单词的不同比率。最后,我们评估了三种表示模型(BoW、TFIDF和BERT)与来自基线实验的最佳分类器(SVM)的检测性能。我们发现经典模型在原始数据集上达到了94%的平衡精度,而BERT模型达到了96%。另一方面,Mad-lib攻击实验表明,BERT编码保持了96%的相似BA性能,平均替换率为1.82个字/条,95%的平均替换率为3.34个字/条。相比之下,BA性能的弓和TFIDF编码器下降的机会。这些结果暗示了BERT模型在对抗这类巧妙攻击方面的潜在优势,在一定程度上抵消了语言中语义关系的不当使用。 摘要:One of the stratagems used to deceive spam filters is to substitute vocables with synonyms or similar words that turn the message unrecognisable by the detection algorithms. In this paper we investigate whether the recent development of language models sensitive to the semantics and context of words, such as Google's BERT, may be useful to overcome this adversarial attack (called "Mad-lib" as per the word substitution game). Using a dataset of 5572 SMS spam messages, we first established a baseline of detection performance using widely known document representation models (BoW and TFIDF) and the novel BERT model, coupled with a variety of classification algorithms (Decision Tree, kNN, SVM, Logistic Regression, Naive Bayes, Multilayer Perceptron). Then, we built a thesaurus of the vocabulary contained in these messages, and set up a Mad-lib attack experiment in which we modified each message of a held out subset of data (not used in the baseline experiment) with different rates of substitution of original words with synonyms from the thesaurus. Lastly, we evaluated the detection performance of the three representation models (BoW, TFIDF and BERT) coupled with the best classifier from the baseline experiment (SVM). We found that the classic models achieved a 94% Balanced Accuracy (BA) in the original dataset, whereas the BERT model obtained 96%. On the other hand, the Mad-lib attack experiment showed that BERT encodings manage to maintain a similar BA performance of 96% with an average substitution rate of 1.82 words per message, and 95% with 3.34 words substituted per message. In contrast, the BA performance of the BoW and TFIDF encoders dropped to chance. These results hint at the potential advantage of BERT models to combat these type of ingenious attacks, offsetting to some extent for the inappropriate use of semantic relationships in language.

机器翻译(2篇)

【1】 Importance-based Neuron Allocation for Multilingual Neural Machine Translation 标题:基于重要性的多语言神经机器翻译中的神经元分配

作者:Wanying Xie,Yang Feng,Shuhao Gu,Dong Yu 机构: Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICTCAS), University of Chinese Academy of Sciences, Beijing, China, Beijing Language and Culture University, China 备注:ACL 2021 链接:https://arxiv.org/abs/2107.06569 摘要:单语言多语言神经机器翻译由于其处理多种语言的能力引起了人们的广泛关注。然而,目前的多语种翻译范式往往使模型倾向于保留一般知识,而忽略了语言的具体知识。以前的一些工作试图通过在模型中添加各种特定于语言的模块来解决这个问题,但是这些工作存在参数爆炸问题,需要专门的手工设计。为了解决这些问题,我们建议根据模型神经元在不同语言中的重要性,将其分为一般神经元和特定语言神经元。一般部分负责保存一般知识并参与所有语言的翻译,而特定语言部分负责保存特定语言的知识并参与某些特定语言的翻译。在IWSLT和Europarl语料库上的实验结果表明了该方法的有效性和通用性。 摘要:Multilingual neural machine translation with a single model has drawn much attention due to its capability to deal with multiple languages. However, the current multilingual translation paradigm often makes the model tend to preserve the general knowledge, but ignore the language-specific knowledge. Some previous works try to solve this problem by adding various kinds of language-specific modules to the model, but they suffer from the parameter explosion problem and require specialized manual design. To solve these problems, we propose to divide the model neurons into general and language-specific parts based on their importance across languages. The general part is responsible for preserving the general knowledge and participating in the translation of all the languages, while the language-specific part is responsible for preserving the language-specific knowledge and participating in the translation of some specific languages. Experimental results on several language pairs, covering IWSLT and Europarl corpus datasets, demonstrate the effectiveness and universality of the proposed method.

【2】 From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text 标题:从机器翻译到语码转换:生成高质量的语码转换文本

作者:Ishan Tarunesh,Syamantak Kumar,Preethi Jyothi 机构:Samsung Korea, Google India, IIT Bombay 备注:In Proceedings of The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021) 链接:https://arxiv.org/abs/2107.06483 摘要:代码转换文本的生成是一个越来越受关注的问题,特别是考虑到包含大量真实代码转换文本的语料库的稀缺性。在这项工作中,我们适应一个最先进的神经机器翻译模型,以产生印地语英语代码转换的句子,从单语的印地语句子。我们概述了精心设计的课程预训练步骤,包括使用合成代码切换文本,使模型生成高质量的代码切换文本。使用从我们的模型中生成的文本作为数据扩充,与使用其他生成模型中的文本相比,我们在语言建模任务中显著减少了困惑。我们还显示了改进使用我们的文本下游代码切换自然语言推理任务。我们生成的文本进一步经过严格的评估,使用人类评估研究和一系列客观指标,其中我们显示性能相当于(有时甚至优于)代码转换文本获得的人群工人谁是母语印地语。 摘要:Generating code-switched text is a problem of growing interest, especially given the scarcity of corpora containing large volumes of real code-switched text. In this work, we adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences starting from monolingual Hindi sentences. We outline a carefully designed curriculum of pretraining steps, including the use of synthetic code-switched text, that enable the model to generate high-quality code-switched text. Using text generated from our model as data augmentation, we show significant reductions in perplexity on a language modeling task, compared to using text from other generative models of CS text. We also show improvements using our text for a downstream code-switched natural language inference task. Our generated text is further subjected to a rigorous evaluation using a human evaluation study and a range of objective metrics, where we show performance comparable (and sometimes even superior) to code-switched text obtained via crowd workers who are native Hindi speakers.

Graph|知识图谱|Knowledge(2篇)

【1】 Scalable Optimal Transport in High Dimensions for Graph Distances, Embedding Alignment, and More 标题:用于图形距离、嵌入对齐等的高维可扩展最佳传输

作者:Johannes Klicpera,Marten Lienen,Stephan Günnemann 机构: a pile of earth) to 1Technical University of Munich 备注:Published as a conference paper at ICML 2021 链接:https://arxiv.org/abs/2107.06876 摘要:目前计算最优传输(OT)的最佳实践是通过熵正则化和Sinkhorn迭代。该算法以二次时间运行,因为它需要完整的成对代价矩阵,这对于大型对象集来说是非常昂贵的。在这项工作中,我们提出了两种有效的代价矩阵对数线性时间近似:第一种是基于局部敏感散列(LSH)的稀疏近似,第二种是基于LSH的稀疏校正的Nystr“om近似,我们称之为局部校正Nystr”om(LCN)。这些近似使得熵正则化OT的一般对数线性时间算法即使在深度学习中常见的复杂、高维空间中也表现良好。我们从理论上分析了这些近似,并在实验上直接和端到端地对它们进行了评估,作为实际应用的一个组成部分。使用我们的无监督单词嵌入对齐的近似方法,我们可以将最先进的方法的速度提高3倍,同时还可以在不改变任何模型的情况下将准确率提高3.1个百分点。对于图距离回归,我们提出了图传输网络(GTN),它将图神经网络(GNNs)与增强Sinkhorn相结合。GTN比以前的模型竞争了48%,并且在节点数量上仍然是线性的。 摘要:The current best practice for computing optimal transport (OT) is via entropy regularization and Sinkhorn iterations. This algorithm runs in quadratic time as it requires the full pairwise cost matrix, which is prohibitively expensive for large sets of objects. In this work we propose two effective log-linear time approximations of the cost matrix: First, a sparse approximation based on locality-sensitive hashing (LSH) and, second, a Nystr"om approximation with LSH-based sparse corrections, which we call locally corrected Nystr"om (LCN). These approximations enable general log-linear time algorithms for entropy-regularized OT that perform well even for the complex, high-dimensional spaces common in deep learning. We analyse these approximations theoretically and evaluate them experimentally both directly and end-to-end as a component for real-world applications. Using our approximations for unsupervised word embedding alignment enables us to speed up a state-of-the-art method by a factor of 3 while also improving the accuracy by 3.1 percentage points without any additional model changes. For graph distance regression we propose the graph transport network (GTN), which combines graph neural networks (GNNs) with enhanced Sinkhorn. GTN outcompetes previous models by 48% and still scales log-linearly in the number of nodes.

【2】 MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation 标题:MMGCN:基于深图卷积网络的会话情感识别多模式融合

作者:Jingwen Hu,Yuchen Liu,Jinming Zhao,Qin Jin 机构:School of Information, Renmin University of China 链接:https://arxiv.org/abs/2107.06779 摘要:会话中的情感识别(ERC)是情感对话系统的重要组成部分,有助于系统理解用户的情感并产生移情反应。然而,大多数的研究主要集中在对说话人和上下文信息的建模上,或者仅仅是通过特征拼接来利用多模态信息。为了探索一种更有效地利用多模态和远距离上下文信息的方法,本文提出了一种基于多模态融合图卷积网络的新模型MMGCN。MMGCN不仅可以有效地利用多模态依赖,而且可以利用说话人信息对说话人之间和说话人内部的依赖进行建模。我们在两个公共基准数据集IEMOCAP和MELD上对所提出的模型进行了评估,结果证明了MMGCN的有效性,在多模态会话环境下,MMGCN的性能明显优于其他SOTA方法。 摘要:Emotion recognition in conversation (ERC) is a crucial component in affective dialogue systems, which helps the system understand users' emotions and generate empathetic responses. However, most works focus on modeling speaker and contextual information primarily on the textual modality or simply leveraging multimodal information through feature concatenation. In order to explore a more effective way of utilizing both multimodal and long-distance contextual information, we propose a new model based on multimodal fused graph convolutional network, MMGCN, in this work. MMGCN can not only make use of multimodal dependencies effectively, but also leverage speaker information to model inter-speaker and intra-speaker dependency. We evaluate our proposed model on two public benchmark datasets, IEMOCAP and MELD, and the results prove the effectiveness of MMGCN, which outperforms other SOTA methods by a significant margin under the multimodal conversation setting.

语料库(1篇)

【1】 ParCourE: A Parallel Corpus Explorer fora Massively Multilingual Corpus 标题:ParCourE:面向大规模多语种语料库的并行语料库探索器

作者:Ayyoob Imani,Masoud Jalili Sabet,Philipp Dufter,Michael Cysouw,Hinrich Schütze 机构:Center for Information and Language Processing (CIS), LMU Munich, Germany, Research Center Deutscher Sprachatlas, Philipps University Marburg, Germany. 备注:The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing 链接:https://arxiv.org/abs/2107.06632 摘要:全世界有7000多种语言,多语言自然语言处理(NLP)无论从学术角度还是从商业角度都是必不可少的。研究语言的类型学性质是多语言自然语言处理的基础。例如,评估语言相似性以实现有效的迁移学习,将归纳偏差注入机器学习模型,或创建字典和屈折变化表等资源。我们提供在线工具ParCourE,它允许浏览一个单词对齐的平行语料库,涵盖1334种语言。我们给出的证据表明,这是有用的类型学研究。ParCourE可以为任何平行语料库建立,因此可以用于其他语料库的类型学研究以及探索它们的质量和性质。 摘要:With more than 7000 languages worldwide, multilingual natural language processing (NLP) is essential both from an academic and commercial perspective. Researching typological properties of languages is fundamental for progress in multilingual NLP. Examples include assessing language similarity for effective transfer learning, injecting inductive biases into machine learning models or creating resources such as dictionaries and inflection tables. We provide ParCourE, an online tool that allows to browse a word-aligned parallel corpus, covering 1334 languages. We give evidence that this is useful for typological research. ParCourE can be set up for any parallel corpus and can thus be used for typological research on other corpora as well as for exploring their quality and properties.

其他神经网络|深度学习|模型|建模(3篇)

【1】 ZR-2021VG: Zero-Resource Speech Challenge, Visually-Grounded Language Modelling track, 2021 edition 标题:ZR-2021VG:零资源语音挑战赛,视觉接地语言建模赛道,2021年版

作者:Afra Alishahia,Grzegorz Chrupała,Alejandrina Cristia,Emmanuel Dupoux,Bertrand Higy,Marvin Lavechin,Okko Räsänen,Chen Yu 机构:Dept. of Cognitive Science and AI, Tilburg University, Netherlands, Laboratoire de Sciences Cognitives et Psycholinguistique, ENS, Paris, France, Unit of Computing Sciences, Tampere University, Finland 链接:https://arxiv.org/abs/2107.06546 摘要:我们展示了在2021年第二轮零资源演讲挑战赛中引入的视觉基础语言建模轨迹。我们激励新的轨道并详细讨论参与规则。我们还介绍了为这条赛道开发的两个基线系统。 摘要:We present the visually-grounded language modelling track that was introduced in the Zero-Resource Speech challenge, 2021 edition, 2nd round. We motivate the new track and discuss participation rules in detail. We also present the two baseline systems that were developed for this track.

【2】 Learning Algebraic Recombination for Compositional Generalization 标题:用于成分泛化的学习代数重组

作者:Chenyao Liu,Shengnan An,Zeqi Lin,Qian Liu,Bei Chen,Jian-Guang Lou,Lijie Wen,Nanning Zheng,Dongmei Zhang 机构: School of Software, Tsinghua University, Xi’an Jiaotong University, Microsoft Research Asia, Beihang University 备注:ACL Findings 2021 链接:https://arxiv.org/abs/2107.06516 摘要:神经序列模型在语义分析任务中表现出有限的合成泛化能力。组合泛化需要代数重组,即以递归方式动态重组结构化表达式。然而,以往的研究大多集中在词汇单位的重组上,这是代数重组的一个重要而不充分的部分。在本文中,我们提出了一个端到端的神经模型LeAR来学习用于合成泛化的代数重组。关键是将语义分析任务建模为潜在语法代数和语义代数之间的同态,从而鼓励代数重组。具体来说,我们将共同学习两个模块:用于生成潜在语法的编写器和用于分配语义操作的解释器。在两个真实的综合合成综合基准上的实验验证了该模型的有效性。源代码在https://github.com/microsoft/ContextualSP. 摘要:Neural sequence models exhibit limited compositional generalization ability in semantic parsing tasks. Compositional generalization requires algebraic recombination, i.e., dynamically recombining structured expressions in a recursive manner. However, most previous studies mainly concentrate on recombining lexical units, which is an important but not sufficient part of algebraic recombination. In this paper, we propose LeAR, an end-to-end neural model to learn algebraic recombination for compositional generalization. The key insight is to model the semantic parsing task as a homomorphism between a latent syntactic algebra and a semantic algebra, thus encouraging algebraic recombination. Specifically, we learn two modules jointly: a Composer for producing latent syntax, and an Interpreter for assigning semantic operations. Experiments on two realistic and comprehensive compositional generalization benchmarks demonstrate the effectiveness of our model. The source code is publicly available at https://github.com/microsoft/ContextualSP.

【3】 Deduplicating Training Data Makes Language Models Better 标题:对训练数据进行重复数据消除使语言模型变得更好

作者:Katherine Lee,Daphne Ippolito,Andrew Nystrom,Chiyuan Zhang,Douglas Eck,Chris Callison-Burch,Nicholas Carlini 机构:‡ University of Pennsylvania 链接:https://arxiv.org/abs/2107.06499 摘要:我们发现现有的语言建模数据集包含许多近似重复的例子和长时间重复的子串。结果,在这些数据集上训练的语言模型中,超过1%的非提示输出是从训练数据中逐字复制的。我们开发了两种工具来消除训练数据集的重复数据——例如,从C4中删除一个61个单词的英语句子,这个句子重复了60000多次。重复数据消除技术使我们能够训练那些发出记忆文本的频率减少十倍的模型,并且需要更少的训练步骤来达到相同或更好的精确度。我们还可以减少列车测试重叠,这会影响标准数据集验证集的4%以上,从而允许更准确的评估。我们发布了代码,用于重新生成我们的工作并执行数据集重复数据消除https://github.com/google-research/deduplicate-text-datasets. 摘要:We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets -- for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.

其他(8篇)

【1】 "How to best say it?" : Translating Directives in Machine Language into Natural Language in the Blocks World

作者:Sujeong Kim,Amir Tamrakar 机构:SRI International, Princeton NJ, USA 链接:https://arxiv.org/abs/2107.06886 摘要:我们提出一种方法来产生最佳的自然语言块放置指令所产生的机器的规划者在人与代理人的互动在块世界。一个非用户友好的机器指令,例如move(ObjId,toPos),被转换成视觉上和上下文上更易于用户理解的引用表达式。我们描述了一种算法,该算法在ECI(elementalcomposableideas)-空间中逐步生成地转换机器指令,生成指令的许多替代版本。然后,我们定义一个成本函数来评估这些替代方案的易理解性,并选择最佳方案。这个成本函数的参数是从一个用户研究中经验得出的,该研究测量了话语到动作的时间。 摘要:We propose a method to generate optimal natural language for block placement directives generated by a machine's planner during human-agent interactions in the blocks world. A non user-friendly machine directive, e.g., move(ObjId, toPos), is transformed into visually and contextually grounded referring expressions that are much easier for the user to comprehend. We describe an algorithm that progressively and generatively transforms the machine's directive in ECI (Elementary Composable Ideas)-space, generating many alternative versions of the directive. We then define a cost function to evaluate the ease of comprehension of these alternatives and select the best option. The parameters for this cost function were derived empirically from a user study that measured utterance-to-action timings.

【2】 Composing Conversational Negation 标题:撰写会话否定词

作者:Razin A. Shaikh,Lia Yeh,Benjamin Rodatz,Bob Coecke 机构:Mathematical Institute, University of Oxford, Quantum Group, Computer Science, Cambridge Quantum 备注:14 pages, many figures, In Proceedings ACT 2020 链接:https://arxiv.org/abs/2107.06820 摘要:自然语言中的否定不遵循布尔逻辑,因此很难建模。特别是,它考虑到对被否定的东西的更广泛的理解。在以前的工作中,我们提出了一个否定词的框架,解释了“世界语境”。在本文中,我们扩展了这个提议,现在考虑到disoccirc框架中语言固有的成分结构。我们将单字的否定词组合成句子的否定词。我们还描述了如何对意义在文本中演变的词进行否定建模。 摘要:Negation in natural language does not follow Boolean logic and is therefore inherently difficult to model. In particular, it takes into account the broader understanding of what is being negated. In previous work, we proposed a framework for negation of words that accounts for `worldly context'. In this paper, we extend that proposal now accounting for the compositional structure inherent in language, within the DisCoCirc framework. We compose the negations of single words to capture the negation of sentences. We also describe how to model the negation of words whose meanings evolve in the text.

【3】 Tortured phrases: A dubious writing style emerging in science. Evidence of critical issues affecting established journals 标题:被折磨的词组:科学界出现的一种可疑的写作风格。影响现有期刊的关键问题的证据

作者:Guillaume Cabanac,Cyril Labbé,Alexander Magazinov 链接:https://arxiv.org/abs/2107.06751 摘要:十多年来,概率文本生成器一直被用来制作假科学论文。这种无意义的文件很容易被人和机器发现。现在,更复杂的人工智能生成技术产生的文本与人类的文本无法区分,从几个关键字生成科学文本的过程已经被记录在案。我们的研究引入了折磨词组的概念:意外的怪异词组代替了既定的词组,比如“伪意识”而不是“人工智能”。我们梳理了折磨词组的文献,研究了一本著名的杂志,这些词组集中在那里。假设使用高级语言模型,我们在本杂志最近文章的摘要和几个控制集上运行了一个检测器。两两比较揭示了期刊上标记为“合成”的摘要的集中。我们还强调了其运作中的不规范之处,如编辑时间的突然变化。我们通过分析几篇可疑的文章来证实我们的调查要求,强调可疑的特征:扭曲的写作风格,引用不存在的文献,以及未被承认的图像重用。令人惊讶的是,一些网站提供免费重写文本,产生了充满折磨词组的狼吞虎咽的书。我们相信一些作者用重写的文本来填充他们的手稿。我们希望提高对含有此类可疑人工智能生成或重写文本的出版物的认识,这些文本通过(糟糕的)同行评审。合成文本的欺骗威胁着科学文献的完整性。 摘要:Probabilistic text generators have been used to produce fake scientific papers for more than a decade. Such nonsensical papers are easily detected by both human and machine. Now more complex AI-powered generation techniques produce texts indistinguishable from that of humans and the generation of scientific texts from a few keywords has been documented. Our study introduces the concept of tortured phrases: unexpected weird phrases in lieu of established ones, such as 'counterfeit consciousness' instead of 'artificial intelligence.' We combed the literature for tortured phrases and study one reputable journal where these concentrated en masse. Hypothesising the use of advanced language models we ran a detector on the abstracts of recent articles of this journal and on several control sets. The pairwise comparisons reveal a concentration of abstracts flagged as 'synthetic' in the journal. We also highlight irregularities in its operation, such as abrupt changes in editorial timelines. We substantiate our call for investigation by analysing several individual dubious articles, stressing questionable features: tortured writing style, citation of non-existent literature, and unacknowledged image reuse. Surprisingly, some websites offer to rewrite texts for free, generating gobbledegook full of tortured phrases. We believe some authors used rewritten texts to pad their manuscripts. We wish to raise the awareness on publications containing such questionable AI-generated or rewritten texts that passed (poor) peer review. Deception with synthetic texts threatens the integrity of the scientific literature.

【4】 Linking Health News to Research Literature 标题:将健康新闻链接到研究文献

作者:Jun Wang,Bei Yu 机构:Independent Researcher, Jamesville, NY , USA, School of Information Studies, Syracuse University 备注:13 pages, 3 figures 链接:https://arxiv.org/abs/2107.06472 摘要:准确地将新闻文章与科学研究工作联系起来是许多应用中的一个关键组成部分,例如衡量研究工作的社会影响和发现科学新闻中的不准确或歪曲。尽管在这些应用中,新闻和文学之间缺乏联系一直是一个挑战,但这是一个相对未被探索的研究问题。在本文中,我们设计并评估了一种新的方法,包括:(1)增强最新的命名实体识别技术以提取各种元数据;(2)设计一种新的弹性搜索引擎,以方便使用丰富的元数据查询。为了评估我们的方法,我们构建了两个成对的新闻文章和研究论文数据集:一个用于训练模型以提取元数据,另一个用于评估。我们的实验表明,新方法的表现明显优于altmetric.com使用的基线方法(0.89比0.32的最高精确度)。为了进一步证明该方法的有效性,我们还对EurekAlert!上发布的37600份健康相关新闻稿进行了研究!,这表明,我们的方法能够识别相应的研究论文,前1名的准确率至少为0.97。 摘要:Accurately linking news articles to scientific research works is a critical component in a number of applications, such as measuring the social impact of a research work and detecting inaccuracies or distortions in science news. Although the lack of links between news and literature has been a challenge in these applications, it is a relatively unexplored research problem. In this paper we designed and evaluated a new approach that consists of (1) augmenting latest named-entity recognition techniques to extract various metadata, and (2) designing a new elastic search engine that can facilitate the use of enriched metadata queries. To evaluate our approach, we constructed two datasets of paired news articles and research papers: one is used for training models to extract metadata, and the other for evaluation. Our experiments showed that the new approach performed significantly better than a baseline approach used by altmetric.com (0.89 vs 0.32 in terms of top-1 accuracy). To further demonstrate the effectiveness of the approach, we also conducted a study on 37,600 health-related press releases published on EurekAlert!, which showed that our approach was able to identify the corresponding research papers with a top-1 accuracy of at least 0.97.

【5】 TSCAN : Dialog Structure discovery using SCAN 标题:TSCAN:使用扫描发现对话结构

作者:Apurba Nath,Aayush Kubba 机构:ayush Kubba 链接:https://arxiv.org/abs/2107.06426 摘要:我们可以通过将话语划分成标记的簇来发现对话结构。可以从数据生成这些标签。通常对于对话框,我们需要一个本体,并使用它来发现结构,然而,通过使用无监督分类和自标记,我们能够在没有任何标签或本体的情况下直观地发现这个结构。本文将最近邻语义聚类(SCAN)应用于对话数据。我们使用BERT作为借口任务,并使用了一种自适应的扫描聚类和自标记。这些簇用于确定转移概率并创建对话框结构。用于扫描的自标记方法使得这些结构可以解释为每个簇都有一个标签。由于该方法是无监督的,评估指标是一个挑战,我们使用统计指标作为结构质量的代理 摘要:Can we discover dialog structure by dividing utterances into labelled clusters. Can these labels be generated from the data. Typically for dialogs we need an ontology and use that to discover structure, however by using unsupervised classification and self-labelling we are able to intuit this structure without any labels or ontology. In this paper we apply SCAN (Semantic Clustering using Nearest Neighbors) to dialog data. We used BERT for pretext task and an adaptation of SCAN for clustering and self labeling. These clusters are used to identify transition probabilities and create the dialog structure. The self-labelling method used for SCAN makes these structures interpretable as every cluster has a label. As the approach is unsupervised, evaluation metrics is a challenge, we use statistical measures as proxies for structure quality

【6】 How Much Can CLIP Benefit Vision-and-Language Tasks? 标题:剪辑对视觉和语言任务有多大好处?

作者:Sheng Shen,Liunian Harold Li,Hao Tan,Mohit Bansal,Anna Rohrbach,Kai-Wei Chang,Zhewei Yao,Kurt Keutzer 机构:†University of California, Berkeley, ‡University of California, Los Angeles, ◦University of North Carolina at Chapel Hill 备注:14 pages 链接:https://arxiv.org/abs/2107.06383 摘要:大多数现有的视觉和语言(V&L)模型依赖于预先训练的视觉编码器,使用相对较少的人工标注数据集(与网络爬网数据相比)来感知视觉世界。然而,大规模的预训练通常可以产生更好的泛化性能,例如,CLIP(对比语言图像预训练)在大量的图像字幕对上训练,在各种视觉任务上表现出很强的Zero-Shot能力。为了进一步研究CLIP带来的优势,我们建议在两种典型的场景下,在各种V&L模型中使用CLIP作为视觉编码器:1)将CLIP插入到特定任务的微调中;2) 将CLIP与V&L预训练相结合,并转移到下游任务。我们发现,CLIP明显优于广泛使用的视觉编码器训练领域内的注释数据,如自下而上自顶向下。我们在不同的V&L任务上取得了有竞争力或更好的结果,同时在视觉问答、视觉蕴涵和V&L导航任务上取得了最新的成果。我们在https://github.com/clip-vil/CLIP-ViL. 摘要:Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks. We release our code at https://github.com/clip-vil/CLIP-ViL.

【7】 What do writing features tell us about AI papers? 标题:关于人工智能论文,写作特征告诉了我们什么?

作者:Zining Zhu,Bai Li,Yang Xu,Frank Rudzicz 机构:Department of Computer Science, University of Toronto, Vector Institute for Artificial Intelligence, Li Ka Shing Knowledge Institute, Unity Health Toronto, Surgical Safety Technologies 备注:15 pages, 4 figures 链接:https://arxiv.org/abs/2107.06310 摘要:随着学术论文提交数量的快速增长,自动、令人信服、高准确度地评估学术论文质量的任务越来越受到关注。我们认为,研究这些意见书的可解释维度可以导致可扩展的解决方案。我们提取了一组写作特征,并构建了一套预测任务来评估这些特征在预测引文数量和人工智能相关论文发表方面的有用性。根据不同的场地,文字功能可以预测会议与研讨会的外观,F1成绩高达60-90分,有时甚至超过基于内容的tf idf功能和罗BERT。我们发现这些特征更多地描述了写作风格而不是内容。为了进一步了解结果,我们估计了最具指示性特征的因果影响。我们对写作特征的分析,为学术论文写作水平的评估和完善提供了一个视角。 摘要:As the numbers of submissions to conferences grow quickly, the task of assessing the quality of academic papers automatically, convincingly, and with high accuracy attracts increasing attention. We argue that studying interpretable dimensions of these submissions could lead to scalable solutions. We extract a collection of writing features, and construct a suite of prediction tasks to assess the usefulness of these features in predicting citation counts and the publication of AI-related papers. Depending on the venues, the writing features can predict the conference vs. workshop appearance with F1 scores up to 60-90, sometimes even outperforming the content-based tf-idf features and RoBERTa. We show that the features describe writing style more than content. To further understand the results, we estimate the causal impact of the most indicative features. Our analysis on writing features provides a perspective to assessing and refining the writing of academic articles at scale.

【8】 How to make qubits speak 标题:如何让量子比特说话

作者:Bob Coecke,Giovanni de Felice,Konstantinos Meichanetzidis,Alexis Toumi 机构:Cambridge Quantum Computing Ltd., Oxford University, Department of Computer Science, Invited contribution to:, “Quantum Computing in the Arts and Humanities” 备注:Invited contribution to "Quantum Computing in the Arts and Humanities" 链接:https://arxiv.org/abs/2107.06776 摘要:这是一个关于让量子计算机说话的故事,并以量子固有的、合成的和有意义的方式来做。最近我们用一台真正的量子计算机来回答问题。我们解释了我们所做的,强调这都是在图片方面做的,并提供了许多相关文献的指针。事实上,除了自然语言之外,许多其他的东西也可以用量子原生的、合成的和意义感知的方式来实现,我们为读者提供了一些关于更广阔的图画景观的指示,包括我们对合成性概念的描述。我们也为实际执行提供了一些指导,以便读者也能试一试。 摘要:This is a story about making quantum computers speak, and doing so in a quantum-native, compositional and meaning-aware manner. Recently we did question-answering with an actual quantum computer. We explain what we did, stress that this was all done in terms of pictures, and provide many pointers to the related literature. In fact, besides natural language, many other things can be implemented in a quantum-native, compositional and meaning-aware manner, and we provide the reader with some indications of that broader pictorial landscape, including our account on the notion of compositionality. We also provide some guidance for the actual execution, so that the reader can give it a go as well.

0 人点赞