自然语言处理学术速递[11.11]

2021-11-17 11:01:38 浏览数 (1)

cs.CL 方向,今日共计13篇

Transformer(2篇)

【1】 Pre-trained Transformer-Based Approach for Arabic Question Answering : A Comparative Study 标题:基于预训练Transformer的阿拉伯语问答方法的比较研究 链接:https://arxiv.org/abs/2111.05671

作者:Kholoud Alsubhi,Amani Jamal,Areej Alhothali 机构:King Abdulaziz University, Department of Computer Science, Jeddah, Kingdom of Saudi Arabia 摘要:问答(QA)是自然语言处理(NLP)中最具挑战性但被广泛研究的问题之一。问答(QA)系统试图为给定的问题生成答案。这些答案可以从非结构化或结构化文本生成。因此,质量保证被认为是一个重要的研究领域,可用于评估文本理解系统。大量的质量保证研究致力于英语,调查最先进的技术并取得最先进的成果。然而,由于缺乏阿拉伯语问答方面的研究工作以及缺乏大型基准数据集,阿拉伯语问答方面的研究工作进展相当缓慢。最近,许多预先训练的语言模型在许多阿拉伯语NLP问题中提供了高性能。在这项工作中,我们使用四个阅读理解数据集,即阿拉伯语班、ARCD、AQAD和TyDiQA GoldP数据集,评估了最先进的阿拉伯语QA预训练Transformer模型。我们对AraBERTv2基本模型、AraBERTv0.2大型模型和AraELECTRA模型的性能进行了微调和比较。最后,我们提供了一个分析来理解和解释一些模型所得到的低性能结果。 摘要:Question answering(QA) is one of the most challenging yet widely investigated problems in Natural Language Processing (NLP). Question-answering (QA) systems try to produce answers for given questions. These answers can be generated from unstructured or structured text. Hence, QA is considered an important research area that can be used in evaluating text understanding systems. A large volume of QA studies was devoted to the English language, investigating the most advanced techniques and achieving state-of-the-art results. However, research efforts in the Arabic question-answering progress at a considerably slower pace due to the scarcity of research efforts in Arabic QA and the lack of large benchmark datasets. Recently many pre-trained language models provided high performance in many Arabic NLP problems. In this work, we evaluate the state-of-the-art pre-trained transformers models for Arabic QA using four reading comprehension datasets which are Arabic-SQuAD, ARCD, AQAD, and TyDiQA-GoldP datasets. We fine-tuned and compared the performance of the AraBERTv2-base model, AraBERTv0.2-large model, and AraELECTRA model. In the last, we provide an analysis to understand and interpret the low-performance results obtained by some models.

【2】 CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval 标题:CLIP2TV:基于Transformer的视频文本检索方法的实证研究 链接:https://arxiv.org/abs/2111.05610

作者:Zijian Gao,Jingyu Liu,Sheng Chen,Dedan Chang,Hao Zhang,Jinwei Yuan 机构:OVBU, Tencent PCG 备注:Tech Report 摘要:现代视频文本检索框架主要由三部分组成:视频编码器、文本编码器和相似头。随着视觉和文本表征学习的成功,基于变换器的编码器和融合方法也被应用于视频文本检索领域。在本报告中,我们介绍了CLIP2TV,旨在探索基于Transformer的方法中的关键要素。为了实现这一点,我们首先回顾了最近关于多模式学习的一些工作,然后将一些技术引入到视频文本检索中,最后通过在不同配置下的大量实验对它们进行评估。值得注意的是,CLIP2TV达到52。9@R1在MSR-VTT数据集上,比之前的SOTA结果高出4.1%。 摘要:Modern video-text retrieval frameworks basically consist of three parts: video encoder, text encoder and the similarity head. With the success on both visual and textual representation learning, transformer based encoders and fusion methods have also been adopted in the field of video-text retrieval. In this report, we present CLIP2TV, aiming at exploring where the critical elements lie in transformer based methods. To achieve this, We first revisit some recent works on multi-modal learning, then introduce some techniques into video-text retrieval, finally evaluate them through extensive experiments in different configurations. Notably, CLIP2TV achieves 52.9@R1 on MSR-VTT dataset, outperforming the previous SOTA result by 4.1%.

BERT(1篇)

【1】 BagBERT: BERT-based bagging-stacking for multi-topic classification 标题:BagBERT:基于BERT的多主题分类打包堆叠 链接:https://arxiv.org/abs/2111.05808

作者:Loïc Rakotoson,Charles Letaillieur,Sylvain Massip,Fréjus Laleye 机构:Opscidia, Paris, France 摘要:本文描述了我们在Biocreative VII上提交的关于新冠病毒-19文献注释任务的报告。我们提出了一种方法,利用知识的全局非最优权重,通常拒绝,建立一个丰富的代表性,每个标签。我们提出的方法分为两个阶段:(1)以弱训练权重为特征的训练数据的各种初始化的打包,(2)基于BERT和RoBERTa嵌入的异构词汇模型的堆叠。这些弱洞察的聚合比经典的全局有效模型表现得更好。其目的是将丰富的知识提炼为更简单、更轻的模型。我们的系统获得了92.96的基于实例的F1和91.35的基于标签的micro-F1。 摘要:This paper describes our submission on the COVID-19 literature annotation task at Biocreative VII. We proposed an approach that exploits the knowledge of the globally non-optimal weights, usually rejected, to build a rich representation of each label. Our proposed approach consists of two stages: (1) A bagging of various initializations of the training data that features weakly trained weights, (2) A stacking of heterogeneous vocabulary models based on BERT and RoBERTa Embeddings. The aggregation of these weak insights performs better than a classical globally efficient model. The purpose is the distillation of the richness of knowledge to a simpler and lighter model. Our system obtains an Instance-based F1 of 92.96 and a Label-based micro-F1 of 91.35.

QA|VQA|问答|对话(1篇)

【1】 A Two-Stage Approach towards Generalization in Knowledge Base Question Answering 标题:知识库问答中的两阶段泛化方法 链接:https://arxiv.org/abs/2111.05825

作者:Srinivas Ravishankar,June Thai,Ibrahim Abdelaziz,Nandana Mihidukulasooriya,Tahira Naseem,Pavan Kapanipathi,Gaetano Rossilleo,Achille Fokoue 机构:∗IBM Research, UMass Amherst 摘要:大多数现有的知识库问答(KBQA)方法都关注于特定的底层知识库,这要么是因为该方法中存在固有的假设,要么是因为在不同的知识库上对其进行评估需要进行非平凡的更改。然而,许多流行的知识库在其底层模式中具有相似性,可以利用这些模式来促进跨知识库的泛化。为了实现这种泛化,我们引入了一个基于两阶段体系结构的KBQA框架,该框架明确地将语义解析与知识库交互分离,促进了跨数据集和知识图的转移学习。我们表明,对具有不同底层知识库的数据集进行预训练仍然可以显著提高性能并降低样本复杂性。我们的方法实现了LC QuAD(DBpedia)、WebQSP(Freebase)、SimpleQuestions(Wikidata)和MetaQA(Wikikg)的可比或最先进的性能。 摘要:Most existing approaches for Knowledge Base Question Answering (KBQA) focus on a specific underlying knowledge base either because of inherent assumptions in the approach, or because evaluating it on a different knowledge base requires non-trivial changes. However, many popular knowledge bases share similarities in their underlying schemas that can be leveraged to facilitate generalization across knowledge bases. To achieve this generalization, we introduce a KBQA framework based on a 2-stage architecture that explicitly separates semantic parsing from the knowledge base interaction, facilitating transfer learning across datasets and knowledge graphs. We show that pretraining on datasets with a different underlying knowledge base can nevertheless provide significant performance gains and reduce sample complexity. Our approach achieves comparable or state-of-the-art performance for LC-QuAD (DBpedia), WebQSP (Freebase), SimpleQuestions (Wikidata) and MetaQA (Wikimovies-KG).

语义分析(1篇)

【1】 MNet-Sim: A Multi-layered Semantic Similarity Network to Evaluate Sentence Similarity 标题:Mnet-Sim:一种用于句子相似度评价的多层次语义相似度网络 链接:https://arxiv.org/abs/2111.05412

作者:Manuela Nayantara Jeyaraj,Dharshana Kasthurirathna 机构:Sri Lanka Institute of Information Technology, Malabe, Sri Lanka 备注:None 摘要:相似性是一种比较主观的度量,它随所考虑的领域而变化。在诸如文档分类、模式识别、聊天机器人问答、情感分析等NLP应用中,识别句子对的准确相似度分数已成为一个关键的研究领域。在现有的评估相似性的模型中,基于上下文比较有效计算这种相似性的局限性、中心理论的局限性以及缺乏非语义文本比较已被证明是缺点。因此,本文提出了一个基于多个相似度度量的多层语义相似度网络模型,该模型基于网络科学原理、邻域加权关系边和一个扩展的节点相似度计算公式,给出了整体句子相似度得分。对所提出的多层网络模型进行了评估,并与已建立的最新模型进行了测试,结果表明,该模型在评估句子相似性方面具有更好的性能分数。 摘要:Similarity is a comparative-subjective measure that varies with the domain within which it is considered. In several NLP applications such as document classification, pattern recognition, chatbot question-answering, sentiment analysis, etc., identifying an accurate similarity score for sentence pairs has become a crucial area of research. In the existing models that assess similarity, the limitation of effectively computing this similarity based on contextual comparisons, the localization due to the centering theory, and the lack of non-semantic textual comparisons have proven to be drawbacks. Hence, this paper presents a multi-layered semantic similarity network model built upon multiple similarity measures that render an overall sentence similarity score based on the principles of Network Science, neighboring weighted relational edges, and a proposed extended node similarity computation formula. The proposed multi-layered network model was evaluated and tested against established state-of-the-art models and is shown to have demonstrated better performance scores in assessing sentence similarity.

Graph|知识图谱|Knowledge(1篇)

【1】 The Wind in Our Sails: Developing a Reusable and Maintainable Dutch Maritime History Knowledge Graph 标题:我们帆上的风:开发可重用和可维护的荷兰海事史知识图 链接:https://arxiv.org/abs/2111.05605

作者:Stijn Schouten,Victor de Boer,Lodewijk Petram,Marieke van Erp 机构:Vrije Universiteit Amsterdam, Department of Computer Science, Amsterdam, The Netherlands, Huygens Instituut, Amsterdam, the Netherlands, KNAW Humanities Cluster, DHLab 备注:Accepted to K-CAP'21, December 2-3, 2021, Virtual Event, USA 摘要:数字资源比以往任何时候都更加普遍,但有效地利用它们可能是一项挑战。一个核心挑战是数字化资源往往是分布式的,因此迫使研究人员花费时间收集、解释和调整不同的资源。知识图可以通过提供人类和机器可以查询的单一关联真相来源来加速研究。在两个设计测试周期中,我们将历史海事领域的四个数据集转换为知识图。这些周期的重点是创建一种可持续和可用的方法,可用于其他关联数据转换工作。此外,我们的知识图表可供海事历史学家和其他感兴趣的用户通过统一门户调查荷兰东印度公司的日常业务。 摘要:Digital sources are more prevalent than ever but effectively using them can be challenging. One core challenge is that digitized sources are often distributed, thus forcing researchers to spend time collecting, interpreting, and aligning different sources. A knowledge graph can accelerate research by providing a single connected source of truth that humans and machines can query. During two design-test cycles, we convert four data sets from the historical maritime domain into a knowledge graph. The focus during these cycles is on creating a sustainable and usable approach that can be adopted in other linked data conversion efforts. Furthermore, our knowledge graph is available for maritime historians and other interested users to investigate the daily business of the Dutch East India Company through a unified portal.

推理|分析|理解|解释(3篇)

【1】 Understanding COVID-19 Vaccine Reaction through Comparative Analysis on Twitter 标题:通过推特上的对比分析了解冠状病毒疫苗的反应 链接:https://arxiv.org/abs/2111.05823

作者:Yuesheng Luo,Mayank Kejriwal 机构:Information Sciences Institute, University of Southern California 备注:20 pages, accepted in the 2022 Computing Conference 摘要:尽管多个新冠病毒-19疫苗已经上市几个月了,但在美国,疫苗的犹豫不决程度仍然很高。在某种程度上,这个问题也变得政治化,特别是在11月总统选举之后。在包括推特在内的社交媒体背景下理解这一时期疫苗的犹豫,可以为计算社会科学家和决策者提供有价值的指导。本文不是研究单一的Twitter语料库,而是采用相同的、精心控制的数据收集和过滤方法,通过对比研究两个不同时期(一个在选举前,另一个在选举后几个月)收集的两个Twitter数据集,从一个新的角度看待这个问题。我们的结果表明,从2020年秋季到2021年春季,讨论从政治到新冠病毒-19疫苗发生了重大转变。通过使用聚类和基于机器学习的方法以及抽样和定性分析,我们发现了疫苗犹豫不决的几个细微原因,其中一些随着时间的推移变得越来越重要(或越来越不重要)。我们的结果还突出表明,在过去一年中,这一问题出现了严重的两极分化和政治化。 摘要:Although multiple COVID-19 vaccines have been available for several months now, vaccine hesitancy continues to be at high levels in the United States. In part, the issue has also become politicized, especially since the presidential election in November. Understanding vaccine hesitancy during this period in the context of social media, including Twitter, can provide valuable guidance both to computational social scientists and policy makers. Rather than studying a single Twitter corpus, this paper takes a novel view of the problem by comparatively studying two Twitter datasets collected between two different time periods (one before the election, and the other, a few months after) using the same, carefully controlled data collection and filtering methodology. Our results show that there was a significant shift in discussion from politics to COVID-19 vaccines from fall of 2020 to spring of 2021. By using clustering and machine learning-based methods in conjunction with sampling and qualitative analysis, we uncover several fine-grained reasons for vaccine hesitancy, some of which have become more (or less) important over time. Our results also underscore the intense polarization and politicization of this issue over the last year.

【2】 Cross-lingual Adaption Model-Agnostic Meta-Learning for Natural Language Understanding 标题:自然语言理解的跨语言适应模型-不可知元学习 链接:https://arxiv.org/abs/2111.05805

作者:Qianying Liu,Fei Cheng,Sadao Kurohashi 机构:Graduate School of Informatics, Kyoto University 备注:11 pages 摘要:使用辅助语言的元学习在跨语言自然语言处理方面有很大的改进。然而,以往的研究从同一种语言中抽取元训练和元测试数据,这限制了该模型的跨语言迁移能力。在本文中,我们提出了XLA-MAML,它在元学习阶段执行直接的跨语言适应。我们对自然语言推理和问答进行了零拍和少拍实验。实验结果证明了我们的方法在不同语言、任务和预训练模型中的有效性。我们还分析了元学习的各种跨语言特定设置,包括抽样策略和并行性。 摘要:Meta learning with auxiliary languages has demonstrated promising improvements for cross-lingual natural language processing. However, previous studies sample the meta-training and meta-testing data from the same language, which limits the ability of the model for cross-lingual transfer. In this paper, we propose XLA-MAML, which performs direct cross-lingual adaption in the meta-learning stage. We conduct zero-shot and few-shot experiments on Natural Language Inference and Question Answering. The experimental results demonstrate the effectiveness of our method across different languages, tasks, and pretrained models. We also give analysis on various cross-lingual specific settings for meta-learning including sampling strategy and parallelism.

【3】 Towards Tractable Mathematical Reasoning: Challenges, Strategies, and Opportunities for Solving Math Word Problems 标题:走向易驾驭的数学推理:解决数学应用题的挑战、策略和机会 链接:https://arxiv.org/abs/2111.05364

作者:Keyur Faldu,Amit Sheth,Prashant Kikani,Manas Gaur,Aditi Avasthi 机构:Embibe, University of, South Carolina 备注:15 pages, 2 tables, 4 figures 摘要:数学推理将是人工智能取得重大进展的下一个前沿领域之一。解决数学单词问题(MWP)从而获得更好的数学推理能力的持续激增将继续是未来研究的一个关键方向。我们检查非神经和神经的方法来解决用自然语言叙述的数学单词问题。我们还强调了这些方法的普遍性、数学合理性、可解释性和可解释性。神经方法主导了当前的研究现状,我们对它们进行了调查,重点介绍了解决MWP问题的三种策略:(1)直接答案生成,(2)推理答案的表达式树生成,(3)答案计算的模板检索。此外,我们还讨论了技术方法,回顾了解决MWP的直观设计选择的演变,并检查了它们的数学推理能力。最后,我们确定了在解决MWP的其他几个机会中,需要外部知识和知识注入学习的几个差距。 摘要:Mathematical reasoning would be one of the next frontiers for artificial intelligence to make significant progress. The ongoing surge to solve math word problems (MWPs) and hence achieve better mathematical reasoning ability would continue to be a key line of research in the coming time. We inspect non-neural and neural methods to solve math word problems narrated in a natural language. We also highlight the ability of these methods to be generalizable, mathematically reasonable, interpretable, and explainable. Neural approaches dominate the current state of the art, and we survey them highlighting three strategies to MWP solving: (1) direct answer generation, (2) expression tree generation for inferring answers, and (3) template retrieval for answer computation. Moreover, we discuss technological approaches, review the evolution of intuitive design choices to solve MWPs, and examine them for mathematical reasoning ability. We finally identify several gaps that warrant the need for external knowledge and knowledge-infused learning, among several other opportunities in solving MWPs.

识别/分类(1篇)

【1】 Important Sentence Identification in Legal Cases Using Multi-Class Classification 标题:多类分类在法律案件重要句子识别中的应用 链接:https://arxiv.org/abs/2111.05721

作者:Sahan Jayasinghe,Lakith Rambukkanage,Ashan Silva,Nisansa de Silva,Amal Shehan Perera 机构:Department of Computer Science & Engineering, University of Moratuwa 摘要:自然语言处理(NLP)的发展正以实际应用和学术兴趣的形式在各个领域传播。从本质上讲,法律领域包含大量文本格式的数据。因此,需要应用NLP来满足该领域的分析需求。对于法律专业人士来说,识别法律案件中的重要句子、事实和论据是一项非常繁琐的任务。在这项研究中,我们从案件主要当事人的角度探讨了多类别分类中句子嵌入的用法,以识别法律案件中的重要句子。此外,定义了特定于任务的损失函数,以提高直接使用分类交叉熵损失所限制的准确性。 摘要:The advancement of Natural Language Processing (NLP) is spreading through various domains in forms of practical applications and academic interests. Inherently, the legal domain contains a vast amount of data in text format. Therefore it requires the application of NLP to cater to the analytically demanding needs of the domain. Identifying important sentences, facts and arguments in a legal case is such a tedious task for legal professionals. In this research we explore the usage of sentence embeddings for multi-class classification to identify important sentences in a legal case, in the perspective of the main parties present in the case. In addition, a task-specific loss function is defined in order to improve the accuracy restricted by the straightforward use of categorical cross entropy loss.

其他神经网络|深度学习|模型|建模(2篇)

【1】 Prune Once for All: Sparse Pre-Trained Language Models 标题:一劳永逸地修剪:稀疏的预先训练的语言模型 链接:https://arxiv.org/abs/2111.05754

作者:Ofir Zafrir,Ariel Larey,Guy Boudoukh,Haihao Shen,Moshe Wasserblat 机构:Intel Labs, Israel, Intel Corporation 备注:ENLSP NeurIPS Workshop 2021, 12 pages 摘要:基于转换器的语言模型在自然语言处理中有着广泛的应用。但是,它们效率低下,难以部署。近年来,为了提高基于目标硬件的大型Transformer模型的实现效率,提出了许多压缩算法。在这项工作中,我们提出了一种新的训练稀疏预训练Transformer语言模型的方法,结合权值剪枝和模型蒸馏。这些稀疏的预训练模型可用于广泛任务的学习迁移,同时保持其稀疏模式。我们用三种已知的体系结构来演示我们的方法,以创建稀疏的预训练的BERT基、BERT大和DistilBERT。我们展示了我们训练的压缩稀疏预训练模型如何将其知识转移到五种不同的下游自然语言任务中,并且精度损失最小。此外,我们还展示了如何使用量化感知训练将稀疏模型的权重进一步压缩到8位精度。例如,通过我们的稀疏预训练BERT Large在SQuADv1.1上进行微调并量化为8比特,我们为编码器实现了$40$X的压缩比,精度损失小于$1%$。据我们所知,我们的结果显示,对于BERT-Base、BERT-Large和DistilBERT,压缩精度比最佳。 摘要:Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models' weights to 8bit precision using quantization-aware training. For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of $40$X for the encoder with less than $1%$ accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, and DistilBERT.

【2】 Learning Logic Rules for Document-level Relation Extraction 标题:文档级关系抽取的学习逻辑规则 链接:https://arxiv.org/abs/2111.05407

作者:Dongyu Ru,Changzhi Sun,Jiangtao Feng,Lin Qiu,Hao Zhou,Weinan Zhang,Yong Yu,Lei Li 机构:†Shanghai Jiao Tong University, ‡ByteDance AI Lab, §University of California, Santa Barbara 备注:Appear at EMNLP 2021 main conference 摘要:文档级关系提取旨在识别整个文档中实体之间的关系。先前捕获长程依赖关系的努力在很大程度上依赖于通过(图形)神经网络学习到的隐式强大表示,这使得模型不那么透明。为了应对这一挑战,本文提出了一种新的概率模型LogiRE,用于通过学习逻辑规则提取文档级关系。LogiRE将逻辑规则视为潜在变量,由两个模块组成:规则生成器和关系提取器。规则生成器将生成可能有助于最终预测的逻辑规则,关系提取器根据生成的逻辑规则输出最终预测。使用期望最大化(EM)算法可以有效地优化这两个模块。通过在神经网络中引入逻辑规则,LogiRE可以明确地捕获长期依赖关系,并享受更好的解释。实证结果表明,LogiRE在关系绩效(1.8 F1分数)和逻辑一致性(超过3.3逻辑分数)方面显著优于几个强基线。我们的代码可在https://github.com/rudongyu/LogiRE. 摘要:Document-level relation extraction aims to identify relations between entities in a whole document. Prior efforts to capture long-range dependencies have relied heavily on implicitly powerful representations learned through (graph) neural networks, which makes the model less transparent. To tackle this challenge, in this paper, we propose LogiRE, a novel probabilistic model for document-level relation extraction by learning logic rules. LogiRE treats logic rules as latent variables and consists of two modules: a rule generator and a relation extractor. The rule generator is to generate logic rules potentially contributing to final predictions, and the relation extractor outputs final predictions based on the generated logic rules. Those two modules can be efficiently optimized with the expectation-maximization (EM) algorithm. By introducing logic rules into neural networks, LogiRE can explicitly capture long-range dependencies as well as enjoy better interpretation. Empirical results show that LogiRE significantly outperforms several strong baselines in terms of relation performance (1.8 F1 score) and logical consistency (over 3.3 logic score). Our code is available at https://github.com/rudongyu/LogiRE.

其他(1篇)

【1】 A Computational Approach to Walt Whitman's Stylistic Changes in Leaves of Grass 标题:沃尔特·惠特曼“草叶集”文体变化的计算分析 链接:https://arxiv.org/abs/2111.05414

作者:Jieyan Zhu 机构:Columbia University 备注:22 pages, 3 figures, 7 tables 摘要:本研究从计算的角度分析了沃尔特·惠特曼在其杰出作品《草叶》中的文体变化,并将研究结果与惠特曼的标准文学批评相关联。语料库包括《草叶》的所有7个版本,从最早的1855年版到1891-1892年的《临终之床》版。从计算单词频率开始,这是最简单的测字技术,我们发现单词选择的变化是一致的。宏观词源分析显示惠特曼越来越偏爱特定来源的词,这与《草叶集》中日益增加的词汇复杂性有关。主成分分析是一种无监督的学习算法,它将tf idf向量的维数降低到了2维,从而提供了风格变化的直观视图。最后,情感分析展示了惠特曼在整个写作生涯中情感状态的演变。 摘要:This study analyzes Walt Whitman's stylistic changes in his phenomenal work Leaves of Grass from a computational perspective and relates findings to standard literary criticism on Whitman. The corpus consists of all 7 editions of Leaves of Grass, ranging from the earliest 1855 edition to the 1891-92 "deathbed" edition. Starting from counting word frequencies, the simplest stylometry technique, we find consistent shifts in word choice. Macro-etymological analysis reveals Whitman's increasing preference for words of specific origins, which is correlated to the increasing lexical complexity in Leaves of Grass. Principal component analysis, an unsupervised learning algorithm, reduces the dimensionality of tf-idf vectors to 2 dimensions, providing a straightforward view of stylistic changes. Finally, sentiment analysis shows the evolution of Whitman's emotional state throughout his writing career.

0 人点赞