自然语言处理学术速递[7.22]

访问www.arxivdaily.com获取含摘要速递，涵盖CS|物理|数学|经济|统计|金融|生物|电气领域，更有搜索、收藏、发帖等功能！点击阅读原文即可访问

cs.CL 方向，今日共计19篇

Transformer(1篇)

【1】 Comparison of Czech Transformers on Text Classification Tasks 标题：捷克Transformer在文本分类任务上的比较

作者：Jan Lehečka,Jan Švec 备注：this https URL 链接：https://arxiv.org/abs/2107.10042 摘要：在这篇论文中，我们介绍了我们在捷克人单语Transformer训练前的进展，并通过向公众发布我们的模型为研究界做出贡献。对这种模型的需求源于我们在特定语言任务中使用转换器的努力，但我们发现已发布的多语言模型的性能非常有限。由于多语言模型通常是从100多种语言中预先训练出来的，因此大多数资源不足的语言（包括捷克语）在这些模型中的代表性不足。同时，像commoncrawl这样的web文档中有大量的单语训练数据。我们已经预先训练并公开发布了两个单语捷克Transformer，并将其与为捷克人训练（至少部分）的相关公共模型进行了比较。本文介绍了Transformer的预训练过程，并对不同领域的文本分类任务的预训练模型进行了比较。摘要：In this paper, we present our progress in pre-training monolingual Transformers for Czech and contribute to the research community by releasing our models for public. The need for such models emerged from our effort to employ Transformers in our language-specific tasks, but we found the performance of the published multilingual models to be very limited. Since the multilingual models are usually pre-trained from 100 languages, most of low-resourced languages (including Czech) are under-represented in these models. At the same time, there is a huge amount of monolingual training data available in web archives like Common Crawl. We have pre-trained and publicly released two monolingual Czech Transformers and compared them with relevant public models, trained (at least partially) for Czech. The paper presents the Transformers pre-training procedure as well as a comparison of pre-trained models on text classification task from various domains.

Graph|知识图谱|Knowledge(1篇)

【1】 CausalBERT: Injecting Causal Knowledge Into Pre-trained Models with Minimal Supervision 标题：CausalBERT：在最小监督下将因果知识注入预先训练的模型

作者：Zhongyang Li,Xiao Ding,Kuo Liao,Ting Liu,Bing Qin 机构：Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, China 链接：https://arxiv.org/abs/2107.09852 摘要：最近的工作表明，成功地结合了像BERT这样的预训练模型来改进NLP系统。然而，现有的预先训练的模型缺乏因果知识，这使得今天的NLP系统无法像人类一样思考。本文研究了在预训练模型中注入因果知识的问题。有两个基本问题：1）如何从非结构化文本中收集大规模的因果资源；2）如何有效地将因果知识注入到预先训练好的模型中。为了解决这些问题，我们提出了CausalBERT，它使用精确的因果模式和因果嵌入技术收集最大规模的因果资源。此外，我们采用一种基于正则化的方法，在注入因果知识的同时，用一个额外的正则化项来保存已经学习的知识。在7个数据集（包括4个因果对分类任务、2个因果QA任务和1个因果推理任务）上进行的大量实验表明，因果关系模型能够捕获丰富的因果知识，并优于所有基于预训练模型的最新方法，实现了一个新的因果推理基准。摘要：Recent work has shown success in incorporating pre-trained models like BERT to improve NLP systems. However, existing pre-trained models lack of causal knowledge which prevents today's NLP systems from thinking like humans. In this paper, we investigate the problem of injecting causal knowledge into pre-trained models. There are two fundamental problems: 1) how to collect a large-scale causal resource from unstructured texts; 2) how to effectively inject causal knowledge into pre-trained models. To address these issues, we propose CausalBERT, which collects the largest scale of causal resource using precise causal patterns and causal embedding techniques. In addition, we adopt a regularization-based method to preserve the already learned knowledge with an extra regularization term while injecting causal knowledge. Extensive experiments on 7 datasets, including four causal pair classification tasks, two causal QA tasks and a causal inference task, demonstrate that CausalBERT captures rich causal knowledge and outperforms all pre-trained models-based state-of-the-art methods, achieving a new causal inference benchmark.

摘要|信息提取(1篇)

【1】 An artificial intelligence natural language processing pipeline for information extraction in neuroradiology 标题：一种用于神经放射学信息提取的人工智能自然语言处理流水线

作者：Henry Watkins,Robert Gray,Ashwani Jha,Parashkev Nachev 机构：Nachev 1UCL Queen Square Institute of Neurology, University College London 备注：20 pages, 2 png image figures 链接：https://arxiv.org/abs/2107.10021 摘要：电子病历由于其非结构化的格式，在医学研究中的应用非常困难。从报告中提取信息，并以一种便于下游分析的方式总结患者陈述，将对操作和临床研究大有裨益。在这项工作中，我们提出了一个自然语言处理管道的信息提取放射报告在神经病学。我们的管道使用基于规则和人工智能模型的混合序列来准确地提取和总结神经报告。我们训练并评估了一个自定义语言模型，该模型来自伦敦国立神经外科医院的150000份放射报告。我们还介绍了特定领域神经放射数据集上标准NLP任务的结果。我们展示了我们的管道，称为“neuroNLP”，能够可靠地从这些报告中提取临床相关信息，使报告的下游建模和相关成像达到前所未有的规模。摘要：The use of electronic health records in medical research is difficult because of the unstructured format. Extracting information within reports and summarising patient presentations in a way amenable to downstream analysis would be enormously beneficial for operational and clinical research. In this work we present a natural language processing pipeline for information extraction of radiological reports in neurology. Our pipeline uses a hybrid sequence of rule-based and artificial intelligence models to accurately extract and summarise neurological reports. We train and evaluate a custom language model on a corpus of 150000 radiological reports from National Hospital for Neurology and Neurosurgery, London MRI imaging. We also present results for standard NLP tasks on domain-specific neuroradiology datasets. We show our pipeline, called `neuroNLP', can reliably extract clinically relevant information from these reports, enabling downstream modelling of reports and associated imaging on a heretofore unprecedented scale.

推理|分析|理解|解释(2篇)

【1】 The Effectiveness of Intermediate-Task Training for Code-Switched Natural Language Understanding 标题：中级任务训练对语码转换自然语言理解的有效性

作者：Archiki Prasad,Mohammad Ali Rehan,Shreya Pathak,Preethi Jyothi 机构：Indian Institute of Technology, Bombay 链接：https://arxiv.org/abs/2107.09931 摘要：虽然最近的基准测试在改进多语言任务的预训练多语言模型的泛化方面激发了大量新的工作，但是改进代码转换自然语言理解任务的技术却很少被探索。在这项工作中，我们建议使用双语中间预训练作为一种可靠的技术，在使用代码切换文本的三种不同的自然语言处理任务上获得大而一致的性能增益。我们在印地语-英语自然语言推理（NLI）、问答（QA）和西班牙语-英语情感分析（SA）的平均准确度和F1分数上分别比以前的最先进系统有了7.87%、20.15%和10.99%的绝对提高。我们在四种不同的代码转换语言对（印地语英语、西班牙语英语、泰米尔语英语和马拉雅拉姆语英语）上展示了SA的一致性能增益。我们还提出了一种代码切换蒙面语言建模（MLM）预训练技术，与使用真实代码切换文本的标准MLM预训练相比，该技术始终有利于SA。摘要：While recent benchmarks have spurred a lot of new work on improving the generalization of pretrained multilingual language models on multilingual tasks, techniques to improve code-switched natural language understanding tasks have been far less explored. In this work, we propose the use of bilingual intermediate pretraining as a reliable technique to derive large and consistent performance gains on three different NLP tasks using code-switched text. We achieve substantial absolute improvements of 7.87%, 20.15%, and 10.99%, on the mean accuracies and F1 scores over previous state-of-the-art systems for Hindi-English Natural Language Inference (NLI), Question Answering (QA) tasks, and Spanish-English Sentiment Analysis (SA) respectively. We show consistent performance gains on four different code-switched language-pairs (Hindi-English, Spanish-English, Tamil-English and Malayalam-English) for SA. We also present a code-switched masked language modelling (MLM) pretraining technique that consistently benefits SA compared to standard MLM pretraining using real code-switched text.

【2】 TLA: Twitter Linguistic Analysis 标题：TLA：推特语言分析

作者：Tushar Sarkar,Nishant Rajadhyaksha 机构：KJ Somaiya College of Engineering, Mumbai 链接：https://arxiv.org/abs/2107.09710 摘要：语言学有助于加深对人性的理解。语言是人类交流中不可缺少的思想、情感和目的，批判性地分析这些语言可以阐明这些社会动物的社会心理行为和特征。社交媒体已经成为人类大规模互动的平台，从而为我们的研究提供了收集和使用这些数据的空间。然而，收集、标记和分析这些数据的整个过程使得整个过程非常繁琐。为了使整个过程更简单和结构化，我们想引入TLA（Twitter语言分析）。在本文中，我们描述了TLA并提供了对该框架的基本理解，讨论了从Twitter收集、标记和分析语言语料库数据的过程，同时为所有语言提供了详细的标记数据集，并在这些数据集上训练了模型。TLA提供的分析也将有助于理解不同语言群体的情感，并在分析的基础上为他们的问题提出新的和创新的解决方案。摘要：Linguistics has been instrumental in developing a deeper understanding of human nature. Words are indispensable to bequeath the thoughts, emotions, and purpose of any human interaction, and critically analyzing these words can elucidate the social and psychological behavior and characteristics of these social animals. Social media has become a platform for human interaction on a large scale and thus gives us scope for collecting and using that data for our study. However, this entire process of collecting, labeling, and analyzing this data iteratively makes the entire procedure cumbersome. To make this entire process easier and structured, we would like to introduce TLA(Twitter Linguistic Analysis). In this paper, we describe TLA and provide a basic understanding of the framework and discuss the process of collecting, labeling, and analyzing data from Twitter for a corpus of languages while providing detailed labeled datasets for all the languages and the models are trained on these datasets. The analysis provided by TLA will also go a long way in understanding the sentiments of different linguistic communities and come up with new and innovative solutions for their problems based on the analysis.

GAN|对抗|攻击|生成相关(3篇)

【1】 Using Adversarial Debiasing to Remove Bias from Word Embeddings 标题：使用对抗性去偏向消除词嵌入中的偏差

作者：Dana Kenna 备注：7 pages, 9 Figures - Adapted from my thesis project 链接：https://arxiv.org/abs/2107.10251 摘要：词汇嵌入已经被证明包含了原始语料库中存在的社会偏见，而现有的处理这一问题的方法只能消除表面偏见。广告借项的方法被认为是类似的肤浅，但这是没有被证实在以前的工作。使用其他方法中证明浅层去除的实验，我显示出结论表明，在消除偏倚方面，dversarial debasing更有效，从而推动了对dversarial debasing效用的进一步研究。摘要：Word Embeddings have been shown to contain the societal biases present in the original corpora.Existing methods to deal with this problem have been shown to only remove superficial biases. Themethod ofAdversarial Debiasingwas presumed to be similarly superficial, but this is was not verifiedin previous works. Using the experiments that demonstrated the shallow removal in other methods, Ishow results that suggestAdversarial Debiasingis more effective at removing bias and thus motivatefurther investigation on the utility ofAdversarial Debiasing.

【2】 Improved Text Classification via Contrastive Adversarial Training 标题：基于对比性对抗性训练的改进文本分类

作者：Lin Pan,Chung-Wei Hang,Avirup Sil,Saloni Potdar,Mo Yu 机构：IBM Watson†, IBM Research AI§, MIT-IBM Watson AI Lab¶ 链接：https://arxiv.org/abs/2107.10137 摘要：我们提出了一个简单而通用的方法来调整文本分类任务中基于Transformer的编码器的微调。具体来说，在微调过程中，我们通过扰动模型的单词嵌入来生成对抗性示例，并对干净示例和对抗性示例进行对比学习，以教导模型学习噪声不变表示。通过训练干净的和敌对的例子以及额外的对比目标，我们观察到一致的改善标准微调干净的例子。在几个胶水基准测试任务中，我们的微调BERT-Large模型比BERT-Large基线平均提高了1.7%，我们的微调RoBERTa-Large模型比RoBERTa-Large基线平均提高了1.3%。此外，我们还使用三个意向分类数据集在不同领域验证了我们的方法，其中我们微调的RoBERTa-Large平均比RoBERTa-Large基线好1-2%。摘要：We propose a simple and general method to regularize the fine-tuning of Transformer-based encoders for text classification tasks. Specifically, during fine-tuning we generate adversarial examples by perturbing the word embeddings of the model and perform contrastive learning on clean and adversarial examples in order to teach the model to learn noise-invariant representations. By training on both clean and adversarial examples along with the additional contrastive objective, we observe consistent improvement over standard fine-tuning on clean examples. On several GLUE benchmark tasks, our fine-tuned BERT Large model outperforms BERT Large baseline by 1.7% on average, and our fine-tuned RoBERTa Large improves over RoBERTa Large baseline by 1.3%. We additionally validate our method in different domains using three intent classification datasets, where our fine-tuned RoBERTa Large outperforms RoBERTa Large baseline by 1-2% on average.

【3】 Guided Generation of Cause and Effect 标题：因果关系的引导生成

作者：Zhongyang Li,Xiao Ding,Ting Liu,J. Edward Hu,Benjamin Van Durme 机构：Harbin Institute of Technology, China, Johns Hopkins University, USA 备注：accepted in IJCAI 2020 main track 链接：https://arxiv.org/abs/2107.09846 摘要：我们提出了一个条件文本生成框架，假设句子表达可能的原因和影响。这个框架依赖于我们在这项工作过程中开发的两个新资源：一个非常大规模的英语句子集合，表达因果模式CausalBank；对以往构建大型词汇因果知识图因果图的工作进行了改进。此外，我们扩展了先前在词汇约束解码方面的工作，以支持析取正约束。人类评估证实，我们的方法提供了高质量和多样化的产出。最后，我们使用CausalBank对一个编码器进行持续的训练，该编码器支持最近最先进的因果推理模型，在模型结构没有变化的情况下，对COPA挑战集进行了3点改进。摘要：We present a conditional text generation framework that posits sentential expressions of possible causes and effects. This framework depends on two novel resources we develop in the course of this work: a very large-scale collection of English sentences expressing causal patterns CausalBank; and a refinement over previous work on constructing large lexical causal knowledge graphs Cause Effect Graph. Further, we extend prior work in lexically-constrained decoding to support disjunctive positive constraints. Human assessment confirms that our approach gives high-quality and diverse outputs. Finally, we use CausalBank to perform continued training of an encoder supporting a recent state-of-the-art model for causal reasoning, leading to a 3-point improvement on the COPA challenge set, with no change in model architecture.

半/弱/无监督|不确定性(1篇)

【1】 VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion 标题：VQMIVC：矢量量化和基于互信息的无监督语音表示解纠缠的一次语音转换

作者：Disong Wang,Liqun Deng,Yu Ting Yeung,Xiao Chen,Xunying Liu,Helen Meng 机构：The Chinese University of Hong Kong, Hong Kong SAR, China, Huawei Noah’s Ark Lab 备注：Accepted to Interspeech 2021. Code, pre-trained models and demo are available at this https URL 链接：https://arxiv.org/abs/2106.10132 摘要：一次语音转换（VC）是通过语音表示解纠缠来实现的一种有效的转换方法，它只需要一个目标说话人的话语作为参考就可以在任意说话人之间进行转换。现有的研究普遍忽略了训练过程中不同语音表征之间的相关性，导致内容信息泄漏到说话人表征中，从而降低了VC的性能。为了缓解这一问题，我们采用矢量量化（VQ）进行内容编码，并在训练过程中引入互信息（MI）作为相关度量，通过无监督的方式减少内容、说话人和基音表示的相互依赖性，实现内容、说话人和基音表示的适当分离。实验结果表明，该方法在保持源语言内容和语调变化，同时捕捉目标说话人特征的前提下，能有效地学习分离语音表征。与现有的单次VC系统相比，该方法具有更高的语音自然度和说话人相似度。我们的代码、预先训练的模型和演示可在https://github.com/Wendison/VQMIVC. 摘要：One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech representation disentanglement. Existing work generally ignores the correlation between different speech representations during training, which causes leakage of content information into the speaker representation and thus degrades VC performance. To alleviate this issue, we employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations for retaining source linguistic content and intonation variations, while capturing target speaker characteristics. In doing so, the proposed approach achieves higher speech naturalness and speaker similarity than current state-of-the-art one-shot VC systems. Our code, pre-trained models and demo are available at https://github.com/Wendison/VQMIVC.

检测相关(1篇)

【1】 Checkovid: A COVID-19 misinformation detection system on Twitter using network and content mining perspectives 标题：Checkovid：一个基于网络和内容挖掘的Twitter冠状病毒错误信息检测系统

作者：Sajad Dadgar,Mehdi Ghatee 机构：Department of Mathematics and Computer Science, Amirkabir University of Technology, No. , Hafez Avenue, Tehran ,-, Iran 备注：20 Pages, 18 Figures, 7 Tables, Submitted for Review Process in a Journal 链接：https://arxiv.org/abs/2107.09768 摘要：在COVID-19大流行期间，由于社会隔离和隔离，社交媒体平台是沟通的理想平台。此外，它也是大规模错误信息传播的主要来源，被称为信息传播。因此，自动破译错误信息是一个至关重要的问题。为了解决这个问题，我们在Twitter上提出了两个COVID-19相关的错误信息数据集，并提出了一个基于机器学习算法和NLP技术的错误信息检测系统。在基于网络的过程中，我们关注社会属性、网络特征和用户。另一方面，在基于内容的分类过程中，我们直接利用tweet的内容对错误信息进行分类，包括文本分类模型（段落级和句子级）和相似度模型。基于网络过程的评价结果表明，人工神经网络模型的评价结果最好，F1值为88.68%。在基于内容的分类过程中，与基于网络的模型相比，我们的新的相似性模型的F1评分为90.26%，显示了错误信息分类结果的改进。此外，在文本分类模型中，使用叠加集成学习模型得到的F1得分为95.18%，取得了最好的分类效果。此外，我们还测试了基于内容的模型Constraint@AAAI2021通过得到94.38%的F1分数，我们改进了基线结果。最后，我们开发了一个名为Checkovid的事实检查网站，它使用每个过程从不同的角度检测COVID-19域中的错误信息和信息性声明。摘要：During the COVID-19 pandemic, social media platforms were ideal for communicating due to social isolation and quarantine. Also, it was the primary source of misinformation dissemination on a large scale, referred to as the infodemic. Therefore, automatic debunking misinformation is a crucial problem. To tackle this problem, we present two COVID-19 related misinformation datasets on Twitter and propose a misinformation detection system comprising network-based and content-based processes based on machine learning algorithms and NLP techniques. In the network-based process, we focus on social properties, network characteristics, and users. On the other hand, we classify misinformation using the content of the tweets directly in the content-based process, which contains text classification models (paragraph-level and sentence-level) and similarity models. The evaluation results on the network-based process show the best results for the artificial neural network model with an F1 score of 88.68%. In the content-based process, our novel similarity models, which obtained an F1 score of 90.26%, show an improvement in the misinformation classification results compared to the network-based models. In addition, in the text classification models, the best result was achieved using the stacking ensemble-learning model by obtaining an F1 score of 95.18%. Furthermore, we test our content-based models on the Constraint@AAAI2021 dataset, and by getting an F1 score of 94.38%, we improve the baseline results. Finally, we develop a fact-checking website called Checkovid that uses each process to detect misinformative and informative claims in the domain of COVID-19 from different perspectives.

Zero/Few/One-Shot|迁移|自适应(1篇)

【1】 Soft Layer Selection with Meta-Learning for Zero-Shot Cross-Lingual Transfer 标题：基于元学习的零射跨语言迁移软层选择

作者：Weijia Xu,Batool Haider,Jason Krone,Saab Mansour 机构：†Department of Computer Science, University of Maryland, ‡Amazon AI 备注：MetaNLP at ACL 2021 链接：https://arxiv.org/abs/2107.09840 摘要：多语言预先训练的上下文嵌入模型（Devlin等人，2019）在Zero-Shot跨语言迁移任务中取得了令人印象深刻的成绩。找到最有效的微调策略，在高资源语言上对这些模型进行微调，使其能够很好地转换到零炮语言，这是一项非常重要的任务。在本文中，我们提出了一种新的元优化器来软选择哪些层的预训练模型冻结微调。我们通过模拟Zero-Shot传输场景来训练元优化器。跨语言自然语言推理的结果表明，与简单的微调基线和X-MAML相比，我们的方法有了改进（Nooralahzadeh等人，2020）。摘要：Multilingual pre-trained contextual embedding models (Devlin et al., 2019) have achieved impressive performance on zero-shot cross-lingual transfer tasks. Finding the most effective fine-tuning strategy to fine-tune these models on high-resource languages so that it transfers well to the zero-shot languages is a non-trivial task. In this paper, we propose a novel meta-optimizer to soft-select which layers of the pre-trained model to freeze during fine-tuning. We train the meta-optimizer by simulating the zero-shot transfer scenario. Results on cross-lingual natural language inference show that our approach improves over the simple fine-tuning baseline and X-MAML (Nooralahzadeh et al., 2020).

Word2Vec|文本|单词(2篇)

【1】 Debiasing Multilingual Word Embeddings: A Case Study of Three Indian Languages 标题：去偏向多语种单词嵌入：以三种印度语言为例

作者：Srijan Bansal,Vishal Garimella,Ayush Suhane,Animesh Mukherjee 机构：Dept. of ECE, IIT Kharagpur, West Bengal, India – , Dept. of CSE, Dept. of Mathematics 备注：This work is accepted as a long paper in the proceedings of ACM HyperText 2021 链接：https://arxiv.org/abs/2107.10181 摘要：在这篇论文中，我们提出了目前最先进的单语词嵌入的消隐方法，以便在多语言环境中得到很好的推广。我们考虑不同的方法来量化偏见和不同的去偏光单语和多语言设置的方法。我们证明了我们的偏差缓解方法对下游NLP应用的重要性。我们提出的方法建立了除英语之外的三种印度语言（印地语、孟加拉语和泰卢固语）的多语言嵌入借记的最新性能。我们相信，我们的工作将为构建无偏见的下游NLP应用程序提供新的机会，这些应用程序本质上依赖于所使用的单词嵌入的质量。摘要：In this paper, we advance the current state-of-the-art method for debiasing monolingual word embeddings so as to generalize well in a multilingual setting. We consider different methods to quantify bias and different debiasing approaches for monolingual as well as multilingual settings. We demonstrate the significance of our bias-mitigation approach on downstream NLP applications. Our proposed methods establish the state-of-the-art performance for debiasing multilingual embeddings for three Indian languages - Hindi, Bengali, and Telugu in addition to English. We believe that our work will open up new opportunities in building unbiased downstream NLP applications that are inherently dependent on the quality of the word embeddings used.

【2】 A Statistical Model of Word Rank Evolution 标题：一种词序演变的统计模型

作者：Alex John Quijano,Rick Dale,Suzanne Sindi 机构：Applied Mathematics, University of California Merced, Merced, California, USA, Department of Communications, University of California Los Angeles, Los Angeles 备注：This manuscript - with 53 pages and 28 figures - is a draft for a journal research article submission 链接：https://arxiv.org/abs/2107.09948 摘要：大量语言数据集的可用性使得数据驱动的方法能够研究语言变化。本文通过对googlebooks语料库单字频率数据集的研究，探讨了八种语言的词序动态。我们观察了从1900年到2008年unigrams的排名变化，并将其与我们为分析开发的Wright-Fisher启发模型进行了比较。该模型模拟了一个中立的进化过程，并以不存在消失词为限制条件。这项工作解释了数学框架的模型-写为一个马尔可夫链与多项式转移概率-显示如何频率的话在时间上的变化。从我们在数据和模型中的观察来看，词的秩稳定性表现出两种类型的特征：（1）秩的增加/减少是单调的，或（2）平均秩保持不变。根据我们的模型，排名高的词往往更稳定，而排名低的词往往更不稳定。有些词的等级变化有两种方式：（a）随着时间的推移，等级的小幅度增加或减少的累积；（b）等级增加或减少的冲击。大多数停止词和斯瓦德什语词在八种语言中都是稳定的。这些特征表明，所有语言中的单字频率都以一种与纯粹中性进化过程不一致的方式发生了变化。摘要：The availability of large linguistic data sets enables data-driven approaches to study linguistic change. This work explores the word rank dynamics of eight languages by investigating the Google Books corpus unigram frequency data set. We observed the rank changes of the unigrams from 1900 to 2008 and compared it to a Wright-Fisher inspired model that we developed for our analysis. The model simulates a neutral evolutionary process with the restriction of having no disappearing words. This work explains the mathematical framework of the model - written as a Markov Chain with multinomial transition probabilities - to show how frequencies of words change in time. From our observations in the data and our model, word rank stability shows two types of characteristics: (1) the increase/decrease in ranks are monotonic, or (2) the average rank stays the same. Based on our model, high-ranked words tend to be more stable while low-ranked words tend to be more volatile. Some words change in ranks in two ways: (a) by an accumulation of small increasing/decreasing rank changes in time and (b) by shocks of increase/decrease in ranks. Most of the stopwords and Swadesh words are observed to be stable in ranks across eight languages. These signatures suggest unigram frequencies in all languages have changed in a manner inconsistent with a purely neutral evolutionary process.

其他神经网络|深度学习|模型|建模(1篇)

【1】 Fine-Grained Causality Extraction From Natural Language Requirements Using Recursive Neural Tensor Networks 标题：基于递归神经张量网络的自然语言需求细粒度因果关系提取

作者：Jannik Fischbach,Tobias Springer,Julian Frattini,Henning Femmer,Andreas Vogelsang,Daniel Mendez 机构：de 2Technical University of Munich, de 3Blekinge Institute of Technology, se 4University of Cologne 链接：https://arxiv.org/abs/2107.09980 摘要：[上下文：]因果关系（例如，如果A，那么B）在功能需求中很普遍。对于AI4RE的各种应用，例如从需求中自动派生合适的测试用例，自动提取这样的因果关系陈述是一个基本的必要条件[问题：]我们缺乏一种能够以细粒度形式从自然语言需求中提取因果关系的方法。具体而言，现有的方法不考虑组合之间的因果关系。它们也不允许将原因和结果分割成更细粒度的文本片段（例如，变量和条件），使得提取的关系不适合自动派生测试用例[目的与贡献：我们填补了这一研究空白，做出了以下贡献：首先，我们提出了因果树库，这是第一个完全标记的二叉解析树语料库，代表了1571个因果需求的组成。其次，提出了一种基于递归神经张量网络的细粒度因果关系抽取器。我们的方法能够恢复用自然语言书写的因果陈述的成分，并且在因果树库的评估中获得了74%的F1分数。第三，我们公开了我们的开放数据集和代码，以促进在RE社区中自动提取因果关系的讨论。摘要：[Context:] Causal relations (e.g., If A, then B) are prevalent in functional requirements. For various applications of AI4RE, e.g., the automatic derivation of suitable test cases from requirements, automatically extracting such causal statements are a basic necessity. [Problem:] We lack an approach that is able to extract causal relations from natural language requirements in fine-grained form. Specifically, existing approaches do not consider the combinatorics between causes and effects. They also do not allow to split causes and effects into more granular text fragments (e.g., variable and condition), making the extracted relations unsuitable for automatic test case derivation. [Objective & Contributions:] We address this research gap and make the following contributions: First, we present the Causality Treebank, which is the first corpus of fully labeled binary parse trees representing the composition of 1,571 causal requirements. Second, we propose a fine-grained causality extractor based on Recursive Neural Tensor Networks. Our approach is capable of recovering the composition of causal statements written in natural language and achieves a F1 score of 74 % in the evaluation on the Causality Treebank. Third, we disclose our open data sets as well as our code to foster the discourse on the automatic extraction of causality in the RE community.

其他(5篇)

【1】 Characterizing Social Imaginaries and Self-Disclosures of Dissonance in Online Conspiracy Discussion Communities 标题：在线阴谋讨论社区中社会想象和不和谐自我暴露的特征

作者：Shruti Phadke,Mattia Samory,Tanushree Mitra 备注：Accepted at CSCW 2021 链接：https://arxiv.org/abs/2107.10204 摘要：在线讨论平台提供了一个论坛，以加强和宣传对错误情报阴谋论的信仰。然而，它们也为阴谋论者提供了表达他们对认知失调的怀疑和体验的途径。这种不和谐的表达可能会说明谁放弃了错误的信仰，以及在什么情况下放弃了错误的信仰。本文描述了关于卡农的不和谐的自我揭露，卡农是一个由神秘领袖Q发起的阴谋论，并被他们的追随者推广，是阴谋论副刊中的无名氏。为了理解阴谋团体中的不和谐和不相信意味着什么，我们首先描述了他们的社会想象，对人们如何集体想象他们的社会存在的广泛理解。我们以4chan和8chan两个图像板的2K个帖子和12个子网站的120万条评论和帖子为重点，采用混合方法来揭示代表卡农社区的运动、期望、实践、英雄和敌人的符号语言。我们使用这些社会想象来创建一个计算框架，用来区分信仰和不和谐与关于卡农的一般性讨论。此外，通过分析QAnon阴谋子编的用户参与度，我们发现，不和谐的自我披露与用户贡献的显著减少相关，并最终与他们离开社区相关。我们提供了一个计算框架，用于识别不和谐的自我披露，并测量围绕不和谐的用户参与度的变化。我们的工作可以提供洞察设计不和谐的干预措施，可以潜在地劝阻阴谋论者从网上阴谋讨论社区。摘要：Online discussion platforms offer a forum to strengthen and propagate belief in misinformed conspiracy theories. Yet, they also offer avenues for conspiracy theorists to express their doubts and experiences of cognitive dissonance. Such expressions of dissonance may shed light on who abandons misguided beliefs and under which circumstances. This paper characterizes self-disclosures of dissonance about QAnon, a conspiracy theory initiated by a mysterious leader Q and popularized by their followers, anons in conspiracy theory subreddits. To understand what dissonance and disbelief mean within conspiracy communities, we first characterize their social imaginaries, a broad understanding of how people collectively imagine their social existence. Focusing on 2K posts from two image boards, 4chan and 8chan, and 1.2 M comments and posts from 12 subreddits dedicated to QAnon, we adopt a mixed methods approach to uncover the symbolic language representing the movement, expectations, practices, heroes and foes of the QAnon community. We use these social imaginaries to create a computational framework for distinguishing belief and dissonance from general discussion about QAnon. Further, analyzing user engagement with QAnon conspiracy subreddits, we find that self-disclosures of dissonance correlate with a significant decrease in user contributions and ultimately with their departure from the community. We contribute a computational framework for identifying dissonance self-disclosures and measuring the changes in user engagement surrounding dissonance. Our work can provide insights into designing dissonance-based interventions that can potentially dissuade conspiracists from online conspiracy discussion communities.

【2】 JEFL: Joint Embedding of Formal Proof Libraries 标题：JEFL：形式证明库的联合嵌入

作者：Qingxiang Wang,Cezary Kaliszyk 机构：University of Innsbruck, Austria, University of Warsaw, Poland 备注：Submission to FroCoS 2021 链接：https://arxiv.org/abs/2107.10188 摘要：在不同的交互式证明辅助库中使用的逻辑基础的异质性使得在它们之间发现相似的数学概念变得困难。在本文中，我们比较了先前提出的跨库概念匹配算法和我们的无监督嵌入方法，它可以帮助我们检索相似的概念。我们的方法基于Word2Vec的fasttext实现，在其上添加了一个树遍历模块，以使其算法适应数据输出管道的表示格式。我们比较了这些方法的可解释性、可定制性和在线可服务性，并认为神经嵌入方法更有潜力集成到交互式证明辅助中。摘要：The heterogeneous nature of the logical foundations used in different interactive proof assistant libraries has rendered discovery of similar mathematical concepts among them difficult. In this paper, we compare a previously proposed algorithm for matching concepts across libraries with our unsupervised embedding approach that can help us retrieve similar concepts. Our approach is based on the fasttext implementation of Word2Vec, on top of which a tree traversal module is added to adapt its algorithm to the representation format of our data export pipeline. We compare the explainability, customizability, and online-servability of the approaches and argue that the neural embedding approach has more potential to be integrated into an interactive proof assistant.

【3】 CATE: CAusality Tree Extractor from Natural Language Requirements 标题：CATE：自然语言需求的因果树抽取器

作者：Noah Jadallah,Jannik Fischbach,Julian Frattini,Andreas Vogelsang 机构：and Andreas Vogelsang 4 1Technical University of Munich, de 3Blekinge Institute of Technology, se 4University of Cologne 链接：https://arxiv.org/abs/2107.10023 摘要：因果关系（如果是A，那么是B）在需求工件中很普遍。从需求中自动提取因果关系对于各种重新活动（例如，自动派生合适的测试用例）具有很大的潜力。然而，我们缺乏一种从自然语言中提取因果关系的方法。本文介绍了我们的工具CATE（CAusality Tree Extractor），它能够将因果关系的组成解析为树结构。CATE不仅概述了句子的因果关系，而且通过将因果关系转化为二叉树来揭示其语义连贯性。我们鼓励其他研究人员和从业者在https://causalitytreeextractor.com/ 摘要：Causal relations (If A, then B) are prevalent in requirements artifacts. Automatically extracting causal relations from requirements holds great potential for various RE activities (e.g., automatic derivation of suitable test cases). However, we lack an approach capable of extracting causal relations from natural language with reasonable performance. In this paper, we present our tool CATE (CAusality Tree Extractor), which is able to parse the composition of a causal relation as a tree structure. CATE does not only provide an overview of causes and effects in a sentence, but also reveals their semantic coherence by translating the causal relation into a binary tree. We encourage fellow researchers and practitioners to use CATE at https://causalitytreeextractor.com/

【4】 How Do Pedophiles Tweet? Investigating the Writing Styles and Online Personas of Child Cybersex Traffickers in the Philippines 标题：恋童癖是怎么发推特的？菲律宾儿童性贩子的写作风格和网络角色调查

作者：Joseph Marvin Imperial 机构：National University, Manila, Philippines 备注：Submitted as a short paper for a conference 链接：https://arxiv.org/abs/2107.09881 摘要：每个人最重要的人道主义责任之一就是保护我们儿童的未来。这不仅需要保护身体健康，而且还需要防止可能影响儿童心理健康的不良事件，例如性胁迫和性虐待，在最坏的情况下，这些行为可能导致终身创伤。在这项研究中，我们使用自然语言处理技术，对菲律宾儿童性贩子如何在Twitter上传播非法色情内容和针对未成年人进行性活动进行了初步调查。我们的研究结果表明，贩运者用来传播内容的常用词和共现词，以及这些实体对该国儿童色情制品扩散所起的四个主要作用。摘要：One of the most important humanitarian responsibility of every individual is to protect the future of our children. This entails not only protection of physical welfare but also from ill events that can potentially affect the mental well-being of a child such as sexual coercion and abuse which, in worst-case scenarios, can result to lifelong trauma. In this study, we perform a preliminary investigation of how child sex peddlers spread illegal pornographic content and target minors for sexual activities on Twitter in the Philippines using Natural Language Processing techniques. Results of our studies show frequently used and co-occurring words that traffickers use to spread content as well as four main roles played by these entities that contribute to the proliferation of child pornography in the country.

【5】 What Do You Get When You Cross Beam Search with Nucleus Sampling? 标题：当您使用Nucleus采样进行交叉光束搜索时，您会得到什么？

作者：Uri Shaham,Omer Levy 机构：The Blavatnik School of Computer Science, Tel Aviv University 链接：https://arxiv.org/abs/2107.09729 摘要：本文将波束搜索与核采样的概率剪枝技术相结合，提出了两种用于自然语言生成的确定性核搜索算法。第一种算法p-exact search对下一个令牌分布进行局部剪枝，并对剩余空间进行精确搜索。第二种算法，动态波束搜索，根据候选概率分布的熵来缩小和扩大波束大小。尽管nucleus搜索背后有概率直觉，但在机器翻译和摘要基准测试上的实验表明，这两种算法都达到了与标准beam搜索相同的性能水平。摘要：We combine beam search with the probabilistic pruning technique of nucleus sampling to create two deterministic nucleus search algorithms for natural language generation. The first algorithm, p-exact search, locally prunes the next-token distribution and performs an exact search over the remaining space. The second algorithm, dynamic beam search, shrinks and expands the beam size according to the entropy of the candidate's probability distribution. Despite the probabilistic intuition behind nucleus search, experiments on machine translation and summarization benchmarks show that both algorithms reach the same performance levels as standard beam search.

linux https 网络安全 python NLP服务

0 人点赞