cs.CL 方向,今日共计15篇
QA|VQA|问答|对话(1篇)
【1】 Speaker and Time-aware Joint Contextual Learning for Dialogue-act Classification in Counselling Conversations 标题:说话人和时间感知的联合语境学习在咨询会话中的对话-行为分类 链接:https://arxiv.org/abs/2111.06647
作者:Ganeshan Malhotra,Abdul Waheed,Aseem Srivastava,Md Shad Akhtar,Tanmoy Chakraborty 机构:BITS Pilani, Goa, India,Maharaja Agrasen Institute of Technology, New Delhi, India,IIIT-Delhi, India 备注:9 pages; Accepted to WSDM 2022 摘要:新冠病毒-19大流行的爆发使人们的心理健康处于危险之中。在这种环境下,社会咨询具有显著的意义。与一般的目标导向对话不同,患者和治疗师之间的对话相当含蓄,尽管对话的目标非常明显。在这种情况下,了解患者的意图对于在治疗过程中提供有效的咨询至关重要,对话系统也是如此。在这项工作中,我们在开发心理健康咨询自动对话系统方面迈出了一小步,但却是重要的一步。我们开发了一个名为HOPE的新数据集,为咨询对话中的对话行为分类提供了一个平台。我们确定了此类对话的需求,并提出了十二个领域特定对话法案(DAC)标签。我们从YouTube上公开的咨询会议视频中收集12.9K的话语,提取它们的转录本,清理,并用DAC标签注释它们。此外,我们还提出了SPARTA,这是一种基于转换器的架构,具有新颖的说话人和时间感知上下文学习,用于对话行为分类。我们的评估显示,在多个基线上有令人信服的表现,实现了最新的希望。我们还用大量的斯巴达经验和定性分析来补充我们的实验。 摘要:The onset of the COVID-19 pandemic has brought the mental health of people under risk. Social counselling has gained remarkable significance in this environment. Unlike general goal-oriented dialogues, a conversation between a patient and a therapist is considerably implicit, though the objective of the conversation is quite apparent. In such a case, understanding the intent of the patient is imperative in providing effective counselling in therapy sessions, and the same applies to a dialogue system as well. In this work, we take forward a small but an important step in the development of an automated dialogue system for mental-health counselling. We develop a novel dataset, named HOPE, to provide a platform for the dialogue-act classification in counselling conversations. We identify the requirement of such conversation and propose twelve domain-specific dialogue-act (DAC) labels. We collect 12.9K utterances from publicly-available counselling session videos on YouTube, extract their transcripts, clean, and annotate them with DAC labels. Further, we propose SPARTA, a transformer-based architecture with a novel speaker- and time-aware contextual learning for the dialogue-act classification. Our evaluation shows convincing performance over several baselines, achieving state-of-the-art on HOPE. We also supplement our experiments with extensive empirical and qualitative analyses of SPARTA.
机器翻译(1篇)
【1】 BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine Translation 标题:BitextEdit:用于改进低资源机器翻译的自动Bitext编辑 链接:https://arxiv.org/abs/2111.06787
作者:Eleftheria Briakou,Sida I. Wang,Luke Zettlemoyer,Marjan Ghazvininejad 机构:University of Maryland, Facebook AI Research 摘要:挖掘BITExts可以包含不完美的翻译,产生不可靠的训练信号的神经机器翻译(NMT)。虽然我们知道过滤这些配对可以提高最终模型的质量,但我们认为,在低资源条件下,即使是挖掘的数据也可能受到限制,这是次优的。在我们的工作中,我们建议通过自动编辑来细化挖掘出的比特文本:给定一个xf语言的句子,以及它可能不完美的翻译xe,我们的模型生成一个修正版本xf'或xe',该版本产生一个更等效的翻译对(即, 摘要:Mined bitexts can contain imperfect translations that yield unreliable training signals for Neural Machine Translation (NMT). While filtering such pairs out is known to improve final model quality, we argue that it is suboptimal in low-resource conditions where even mined data can be limited. In our work, we propose instead, to refine the mined bitexts via automatic editing: given a sentence in a language xf, and a possibly imperfect translation of it xe, our model generates a revised version xf' or xe' that yields a more equivalent translation pair (i.e.,or
摘要|信息提取(1篇)
【1】 AnswerSumm: A Manually-Curated Dataset and Pipeline for Answer Summarization 标题:AnswerSumm:一种人工管理的答案摘要数据集和管道 链接:https://arxiv.org/abs/2111.06474
作者:Alexander R. Fabbri,Xiaojian Wu,Srini Iyer,Haoran Li,Mona Diab 机构:† Yale University ‡ Facebook AI 备注:arXiv admin note: substantial text overlap with arXiv:2104.08536 摘要:社区问答(CQA)论坛,如Stack Overflow和Yahoo!答案包含了对广泛的基于社区的问题的丰富答案资源。每个问题线索都可以从不同角度获得大量答案。答案总结的一个目标是生成一个反映答案视角范围的总结。抽象答案摘要的一个主要障碍是缺少一个数据集来监督生成这样的摘要。最近的工作提出了创建此类数据的启发式方法,但这些方法通常很嘈杂,并且没有涵盖答案中的所有观点。这项工作介绍了一个由专业语言学家策划的4631个CQA线程的新颖数据集,用于答案摘要。我们的管道收集答案摘要中涉及的所有子任务的注释,包括选择与问题相关的答案句子,根据观点对这些句子进行分组,总结每个观点,并生成总体摘要。我们在这些子任务上分析并基准测试了最先进的模型,并引入了一种新的无监督多视角数据扩充方法,该方法通过自动评估进一步提高了总体摘要性能。最后,我们提出强化学习奖励,以提高事实一致性和答案覆盖率,并分析需要改进的领域。 摘要:Community Question Answering (CQA) fora such as Stack Overflow and Yahoo! Answers contain a rich resource of answers to a wide range of community-based questions. Each question thread can receive a large number of answers with different perspectives. One goal of answer summarization is to produce a summary that reflects the range of answer perspectives. A major obstacle for abstractive answer summarization is the absence of a dataset to provide supervision for producing such summaries. Recent works propose heuristics to create such data, but these are often noisy and do not cover all perspectives present in the answers. This work introduces a novel dataset of 4,631 CQA threads for answer summarization, curated by professional linguists. Our pipeline gathers annotations for all subtasks involved in answer summarization, including the selection of answer sentences relevant to the question, grouping these sentences based on perspectives, summarizing each perspective, and producing an overall summary. We analyze and benchmark state-of-the-art models on these subtasks and introduce a novel unsupervised approach for multi-perspective data augmentation, that further boosts overall summarization performance according to automatic evaluation. Finally, we propose reinforcement learning rewards to improve factual consistency and answer coverage and analyze areas for improvement.
推理|分析|理解|解释(2篇)
【1】 On Transferability of Prompt Tuning for Natural Language Understanding 标题:论自然语言理解中即时调谐的可转移性 链接:https://arxiv.org/abs/2111.06719
作者:Yusheng Su,Xiaozhi Wang,Yujia Qin,Chi-Min Chan,Yankai Lin,Zhiyuan Liu,Peng Li,Juanzi Li,Lei Hou,Maosong Sun,Jie Zhou 机构:Department of Computer Science and Technology, Tsinghua University, Beijing, China, Pattern Recognition Center, WeChat AI, Tencent Inc. 摘要:提示调优(PT)是一种利用超大预训练语言模型(PLM)的很有前途的参数有效方法,它只需调优几个软提示即可实现与全参数微调相当的性能。然而,与微调相比,PT经验上需要更多的训练步骤。为了探索是否可以通过重用经过训练的软提示和共享所学知识来提高PT的效率,我们实证研究了软提示在不同任务和模型之间的可转移性。在跨任务转移中,我们发现经过训练的软提示可以很好地转移到相似的任务中,并为它们初始化PT,以加速训练并提高性能。此外,为了探索哪些因素影响提示在任务间的传递性,我们研究了如何测量提示相似性,并发现激活神经元的重叠率与传递性高度相关。在跨模型传输中,我们探索了如何将一个PLM的提示投射到另一个PLM,并成功地训练了一种投影仪,该投影仪可以在类似任务中实现非平凡的传输性能。然而,使用投影提示初始化PT效果不佳,这可能是由于优化首选项和PLM的高冗余造成的。我们的研究结果表明,通过知识转移来提高PT是可能的和有希望的,而提示的跨任务转移能力通常优于跨模型转移能力。 摘要:Prompt tuning (PT) is a promising parameter-efficient method to utilize extremely large pre-trained language models (PLMs), which could achieve comparable performance to full-parameter fine-tuning by only tuning a few soft prompts. However, compared to fine-tuning, PT empirically requires much more training steps. To explore whether we can improve the efficiency of PT by reusing trained soft prompts and sharing learned knowledge, we empirically investigate the transferability of soft prompts across different tasks and models. In cross-task transfer, we find that trained soft prompts can well transfer to similar tasks and initialize PT for them to accelerate training and improve performance. Moreover, to explore what factors influence prompts' transferability across tasks, we investigate how to measure the prompt similarity and find that the overlapping rate of activated neurons highly correlates to the transferability. In cross-model transfer, we explore how to project the prompts of a PLM to another PLM and successfully train a kind of projector which can achieve non-trivial transfer performance on similar tasks. However, initializing PT with the projected prompts does not work well, which may be caused by optimization preferences and PLMs' high redundancy. Our findings show that improving PT with knowledge transfer is possible and promising, while prompts' cross-task transferability is generally better than the cross-model transferability.
【2】 On-the-Fly Rectification for Robust Large-Vocabulary Topic Inference 标题:稳健大词汇量话题推理的飞翔纠错 链接:https://arxiv.org/abs/2111.06580
作者:Moontae Lee,Sungjun Cho,Kun Dong,David Mimno,David Bindel 机构:Cornell University 摘要:在许多数据域中,关于对象的联合外观的共现统计信息非常有用。通过将无监督学习问题转化为共现统计的分解,谱算法为后验推理(如潜在主题分析和社区检测)提供了透明高效的算法。然而,随着对象词汇表的增长,存储和运行基于共现统计的推理算法的成本会迅速增加。纠正共现(维护模型假设的关键过程)在出现稀有术语时变得越来越重要,但目前的技术无法扩展到大型词汇表。我们提出了一种新的方法,可以同时压缩和校正共现统计信息,并根据词汇的大小和潜在空间的大小进行适当的缩放。我们还提出了从压缩统计数据中学习潜在变量的新算法,并验证了我们的方法在文本和非文本数据上的性能与以前的方法相当。 摘要:Across many data domains, co-occurrence statistics about the joint appearance of objects are powerfully informative. By transforming unsupervised learning problems into decompositions of co-occurrence statistics, spectral algorithms provide transparent and efficient algorithms for posterior inference such as latent topic analysis and community detection. As object vocabularies grow, however, it becomes rapidly more expensive to store and run inference algorithms on co-occurrence statistics. Rectifying co-occurrence, the key process to uphold model assumptions, becomes increasingly more vital in the presence of rare terms, but current techniques cannot scale to large vocabularies. We propose novel methods that simultaneously compress and rectify co-occurrence statistics, scaling gracefully with the size of vocabulary and the dimension of latent space. We also present new algorithms learning latent variables from the compressed statistics, and verify that our methods perform comparably to previous approaches on both textual and non-textual data.
Word2Vec|文本|单词(2篇)
【1】 RATE: Overcoming Noise and Sparsity of Textual Features in Real-Time Location Estimation 标题:Rate:克服实时位置估计中文本特征的噪声和稀疏性 链接:https://arxiv.org/abs/2111.06515
作者:Yu Zhang,Wei Wei,Binxuan Huang,Kathleen M. Carley,Yan Zhang 机构:Key Laboratory of Machine Perception (MOE), Peking University, Beijing, China, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA 备注:4 pages; Accepted to CIKM 2017; Some typos fixed 摘要:社交媒体用户的实时位置推断是局部搜索和事件检测等空间应用的基础。虽然tweet文本是位置估计中最常用的特征,但之前的大多数工作都受到文本特征的噪声或稀疏性的影响。本文旨在解决这两个问题。我们使用主题建模作为构造块来描述地理主题变化和词汇变化,这样“一个热门”编码向量将不再直接使用。我们还加入了其他可以通过Twitter流API提取的功能,以克服噪音问题。实验结果表明,该算法在区域分类精度和经纬度回归的平均距离误差方面均优于几种基准方法。 摘要:Real-time location inference of social media users is the fundamental of some spatial applications such as localized search and event detection. While tweet text is the most commonly used feature in location estimation, most of the prior works suffer from either the noise or the sparsity of textual features. In this paper, we aim to tackle these two problems. We use topic modeling as a building block to characterize the geographic topic variation and lexical variation so that "one-hot" encoding vectors will no longer be directly used. We also incorporate other features which can be extracted through the Twitter streaming API to overcome the noise problem. Experimental results show that our RATE algorithm outperforms several benchmark methods, both in the precision of region classification and the mean distance error of latitude and longitude regression.
【2】 SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets 标题:SynthBio:文本数据集人工智能协同生成的实例研究 链接:https://arxiv.org/abs/2111.06467
作者:Ann Yuan,Daphne Ippolito,Vitaly Nikolaev,Chris Callison-Burch,Andy Coenen,Sebastian Gehrmann 机构:Google Research, University of Pennsylvania 备注:10 pages, 2 figures, accepted to NeurIPS 2021 Datasets and Benchmarks Track 摘要:NLP研究人员需要更多、更高质量的文本数据集。人类标记的数据集的收集成本很高,而通过自动检索从网络(如WikiBio)收集的数据集很嘈杂,可能包含不必要的偏见。此外,来自网络的数据通常包含在用于预训练模型的数据集中,导致训练集和测试集的无意交叉污染。在这项工作中,我们介绍了一种高效数据集整理的新方法:我们使用大型语言模型为人类评分员提供种子代,从而将数据集编写从编写任务更改为编辑任务。我们使用我们的方法来策划SynthBio——WikiBio的一个新评估集——由描述虚构个体的结构化属性列表组成,映射到自然语言传记。我们发现,我们的虚构传记数据集比维基百科的噪音小,而且在性别和国籍方面更平衡。 摘要:NLP researchers need more, higher-quality text datasets. Human-labeled datasets are expensive to collect, while datasets collected via automatic retrieval from the web such as WikiBio are noisy and can include undesired biases. Moreover, data sourced from the web is often included in datasets used to pretrain models, leading to inadvertent cross-contamination of training and test sets. In this work we introduce a novel method for efficient dataset curation: we use a large language model to provide seed generations to human raters, thereby changing dataset authoring from a writing task to an editing task. We use our method to curate SynthBio - a new evaluation set for WikiBio - composed of structured attribute lists describing fictional individuals, mapped to natural language biographies. We show that our dataset of fictional biographies is less noisy than WikiBio, and also more balanced with respect to gender and nationality.
其他神经网络|深度学习|模型|建模(3篇)
【1】 Extraction of Medication Names from Twitter Using Augmentation and an Ensemble of Language Models 标题:基于增强和语言模型集成的Twitter药物名称提取 链接:https://arxiv.org/abs/2111.06664
作者:Igor Kulev,Berkay Köprü,Raul Rodriguez-Esteban,Diego Saldana,Yi Huang,Alessandro La Torraca,Elif Ozkirimli 机构:. Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Switzerland, . Personalized Healthcare Center of Excellence, F. Hoffmann-La Roche Ltd, Basel, Switzerland 备注:Proceedings of the BioCreative VII Challenge Evaluation Workshop 摘要:BioCreative VII Track 3挑战的重点是在Twitter用户时间表中识别药物名称。为了应对这一挑战,我们使用了几种数据增强技术来扩展可用的训练数据。然后,这些增强的数据被用来微调一组语言模型,这些语言模型是在一般领域Twitter内容上预先训练过的。所提出的方法优于先前最先进的Kusuri算法,并在所选目标函数(重叠F1分数)的竞争中排名靠前。 摘要:The BioCreative VII Track 3 challenge focused on the identification of medication names in Twitter user timelines. For our submission to this challenge, we expanded the available training data by using several data augmentation techniques. The augmented data was then used to fine-tune an ensemble of language models that had been pre-trained on general-domain Twitter content. The proposed approach outperformed the prior state-of-the-art algorithm Kusuri and ranked high in the competition for our selected objective function, overlapping F1 score.
【2】 A Convolutional Neural Network Based Approach to Recognize Bangla Spoken Digits from Speech Signal 标题:一种基于卷积神经网络的孟加拉语音数字识别方法 链接:https://arxiv.org/abs/2111.06625
作者:Ovishake Sen,Al-Mahmud,Pias Roy 机构:Computer Science and Engineering, Khulna University of Engineering, & Technology, Khulna, Bangladesh 备注:4 pages, 5 figures, 2021 International Conference on Electronics, Communications and Information Technology (ICECIT), 14 to 16 September 2021, Khulna, Bangladesh 摘要:语音识别是一种将人类的语音信号转换成文本或文字,或以计算机或其他机器容易理解的任何形式的技术。有一些关于孟加拉语数字识别系统的研究,其中大多数使用的是在性别、年龄、方言和其他变量上几乎没有变化的小型数据集。本研究使用不同性别、年龄和方言的孟加拉国人的录音来创建一个大型语音数据集,该数据集包含说话的“0-9”孟加拉语数字。在这里,为创建数据集,每个数字记录了400个噪声和无噪声样本。Mel倒谱系数(MFCC)被用于从原始语音数据中提取有意义的特征。然后,利用卷积神经网络(CNN)检测孟加拉语数字。建议的技术在整个数据集中识别“0-9”孟加拉语语音数字的准确率为97.1%。使用10倍交叉验证对模型的效率进行了评估,获得了96.7%的准确率。 摘要:Speech recognition is a technique that converts human speech signals into text or words or in any form that can be easily understood by computers or other machines. There have been a few studies on Bangla digit recognition systems, the majority of which used small datasets with few variations in genders, ages, dialects, and other variables. Audio recordings of Bangladeshi people of various genders, ages, and dialects were used to create a large speech dataset of spoken '0-9' Bangla digits in this study. Here, 400 noisy and noise-free samples per digit have been recorded for creating the dataset. Mel Frequency Cepstrum Coefficients (MFCCs) have been utilized for extracting meaningful features from the raw speech data. Then, to detect Bangla numeral digits, Convolutional Neural Networks (CNNs) were utilized. The suggested technique recognizes '0-9' Bangla spoken digits with 97.1% accuracy throughout the whole dataset. The efficiency of the model was also assessed using 10-fold crossvalidation, which yielded a 96.7% accuracy.
【3】 Benchmarking deep generative models for diverse antibody sequence design 标题:针对不同抗体序列设计的标杆深度生成模型 链接:https://arxiv.org/abs/2111.06801
作者:Igor Melnyk,Payel Das,Vijil Chenthamarakshan,Aurelie Lozano 机构:IBM Research AI 备注:Learning Meaningful Representations of Life Workshop paper at NeurIPS 2021 摘要:计算蛋白质设计,即推断与给定结构一致的新的和多样的蛋白质序列,仍然是一个主要的未解决的挑战。最近,单独从序列或从序列和结构联合学习的深层生成模型在这项任务中表现出了令人印象深刻的性能。然而,这些模型在建模结构约束、获取足够的序列多样性或两者兼而有之方面显得有限。在这里,我们考虑三个最近提出的蛋白质设计的深生成框架:(AR)基于序列的自回归生成模型,(GVP)基于精确结构的图神经网络,和FoLD2SEQ,它利用三维折叠的模糊和无标度表示,同时将结构强加给序列(反之亦然)。一致性我们在抗体序列的计算设计任务上对这些模型进行了基准测试,这要求设计具有高度多样性的序列以实现功能含义。Fold2Seq框架在保持典型折叠的同时,在设计序列的多样性方面优于其他两个基线。 摘要:Computational protein design, i.e. inferring novel and diverse protein sequences consistent with a given structure, remains a major unsolved challenge. Recently, deep generative models that learn from sequences alone or from sequences and structures jointly have shown impressive performance on this task. However, those models appear limited in terms of modeling structural constraints, capturing enough sequence diversity, or both. Here we consider three recently proposed deep generative frameworks for protein design: (AR) the sequence-based autoregressive generative model, (GVP) the precise structure-based graph neural network, and Fold2Seq that leverages a fuzzy and scale-free representation of a three-dimensional fold, while enforcing structure-to-sequence (and vice versa) consistency. We benchmark these models on the task of computational design of antibody sequences, which demand designing sequences with high diversity for functional implication. The Fold2Seq framework outperforms the two other baselines in terms of diversity of the designed sequences, while maintaining the typical fold.
其他(5篇)
【1】 Speeding Up Entmax 标题:加速Entmax 链接:https://arxiv.org/abs/2111.06832
作者:Maxat Tezekbayev,Vassilina Nikoulina,Matthias Gallé,Zhenisbek Assylbekov 机构:School of Sciences and Humanities, Nazarbayev University, NAVER Labs Europe 备注:8 pages, 6 figures 摘要:Softmax是现代神经网络中用于语言处理的规范化逻辑的事实标准。然而,通过产生密集的概率分布,词汇表中的每个标记在每个生成步骤中被选择的概率都不是零,这导致了文本生成中出现的各种问题$arXiv:1905.05702的alpha$-entmax解决了这个问题,但比softmax慢很多。在本文中,我们提出了一个替代$ 阿尔法$ -EnthMax,它保持其良好的特性,但与优化的SOFTMax一样快,达到机器翻译任务的PAR或更好的性能。 摘要:Softmax is the de facto standard in modern neural networks for language processing when it comes to normalizing logits. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being selected at each generation step, leading to a variety of reported problems in text generation. $alpha$-entmax of arXiv:1905.05702 solves this problem, but is considerably slower than softmax. In this paper, we propose an alternative to $alpha$-entmax, which keeps its virtuous characteristics, but is as fast as optimized softmax and achieves on par or better performance in machine translation task.
【2】 Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR 标题:破译语音:ASR中跨语言迁移的零资源方法 链接:https://arxiv.org/abs/2111.06799
作者:Ondrej Klejch,Electra Wallington,Peter Bell 机构:Centre for Speech Technology Research, University of Edinburgh, United Kingdom 摘要:我们提出了一种在ASR系统中进行跨语言训练的方法,该系统完全不使用来自目标语言的转录训练数据,也不使用有关语言的语音知识。我们的方法使用了一种新的解密算法,该算法只对来自目标语言的未配对语音和文本数据进行操作。我们将这种破译应用于通用电话识别器在语言外语音语料库上训练生成的电话序列,然后进行平启动半监督训练,以获得新语言的声学模型。据我们所知,这是第一个不依赖任何手工语音信息的零资源跨语言ASR的实用方法。我们对从GlobalPhone语料库中读取的语音进行了实验,结果表明,只需20分钟的目标语言数据就可以学习解码模型。当用于生成用于半监督训练的伪标签时,我们得到的WER比在相同数据上训练的等效全监督模型差25%到5%。 摘要:We present a method for cross-lingual training an ASR system using absolutely no transcribed training data from the target language, and with no phonetic knowledge of the language in question. Our approach uses a novel application of a decipherment algorithm, which operates given only unpaired speech and text data from the target language. We apply this decipherment to phone sequences generated by a universal phone recogniser trained on out-of-language speech corpora, which we follow with flat-start semi-supervised training to obtain an acoustic model for the new language. To the best of our knowledge, this is the first practical approach to zero-resource cross-lingual ASR which does not rely on any hand-crafted phonetic information. We carry out experiments on read speech from the GlobalPhone corpus, and show that it is possible to learn a decipherment model on just 20 minutes of data from the target language. When used to generate pseudo-labels for semi-supervised training, we obtain WERs that range from 25% to just 5% absolute worse than the equivalent fully supervised models trained on the same data.
【3】 Variation and generality in encoding of syntactic anomaly information in sentence embeddings 标题:句子嵌入中句法异常信息编码的变异性和共性 链接:https://arxiv.org/abs/2111.06644
作者:Qinxuan Wu,Allyson Ettinger 机构:College of Computer Science and Technology, Zhejiang University, Department of Linguistics, The University of Chicago 备注:BlackBoxNLP, EMNLP 摘要:虽然句子异常已经被周期性地应用于NLP测试,但我们尚未建立NLP模型表示中异常信息的精确状态图。在本文中,我们旨在填补两个主要的空白,重点是句法异常的领域。首先,我们通过设计探测任务来探索异常编码中的细粒度差异,探测任务会改变句子中出现异常的层次结构。其次,通过检查不同异常类型之间的传输,我们不仅测试了模型检测给定异常的能力,还测试了检测到的异常信号的通用性。结果表明,所有模型都编码了一些支持异常检测的信息,但异常之间的检测性能有所不同,只有最近Transformer模型的表示才显示出广义异常知识的迹象。后续分析支持这样一种观点,即这些模型采用了一种合法的、普遍的句子奇怪性概念,而粗粒度的单词位置信息也可能是观察到的异常检测的一个因素。 摘要:While sentence anomalies have been applied periodically for testing in NLP, we have yet to establish a picture of the precise status of anomaly information in representations from NLP models. In this paper we aim to fill two primary gaps, focusing on the domain of syntactic anomalies. First, we explore fine-grained differences in anomaly encoding by designing probing tasks that vary the hierarchical level at which anomalies occur in a sentence. Second, we test not only models' ability to detect a given anomaly, but also the generality of the detected anomaly signal, by examining transfer between distinct anomaly types. Results suggest that all models encode some information supporting anomaly detection, but detection performance varies between anomalies, and only representations from more recent transformer models show signs of generalized knowledge of anomalies. Follow-up analyses support the notion that these models pick up on a legitimate, general notion of sentence oddity, while coarser-grained word position information is likely also a contributor to the observed anomaly detection.
【4】 PESTO: Switching Point based Dynamic and Relative Positional Encoding for Code-Mixed Languages 标题:PESTO:基于切换点的代码混合语言动态相对位置编码 链接:https://arxiv.org/abs/2111.06599
作者:Mohsin Ali,Kandukuri Sai Teja,Sumanth Manduru,Parth Patwa,Amitava Das 机构:IIIT Sri City, India, UCLA, USA, Wipro AI Labs, India, AI Institute, University of South Carolina, USA 备注:Accepted as Student Abstract at AAAI 2022 摘要:最近,针对代码混合(CM)或混合语言文本的NLP应用取得了巨大的发展势头,主要原因是在印度、墨西哥、欧洲、美国部分地区等多语言社会的社交媒体通信中,语言混合非常普遍。单词嵌入是当今任何NLP系统的基本构建块,CM语言的单词嵌入是一个尚未探索的领域。CM单词嵌入的主要瓶颈是语言切换的切换点。由于所见示例的高度差异,这些位置缺乏上下文和统计系统,无法对这种现象进行建模。在本文中,我们提出了我们的初步观察应用开关点为基础的位置编码技术的CM语言,特别是Hinglish(印地语英语)。结果仅略好于SOTA,但很明显,位置编码是为CM文本训练位置敏感语言模型的有效方法。 摘要:NLP applications for code-mixed (CM) or mix-lingual text have gained a significant momentum recently, the main reason being the prevalence of language mixing in social media communications in multi-lingual societies like India, Mexico, Europe, parts of USA etc. Word embeddings are basic build-ing blocks of any NLP system today, yet, word embedding for CM languages is an unexplored territory. The major bottleneck for CM word embeddings is switching points, where the language switches. These locations lack in contextually and statistical systems fail to model this phenomena due to high variance in the seen examples. In this paper we present our initial observations on applying switching point based positional encoding techniques for CM language, specifically Hinglish (Hindi - English). Results are only marginally better than SOTA, but it is evident that positional encoding could bean effective way to train position sensitive language models for CM text.
【5】 Catalytic Role Of Noise And Necessity Of Inductive Biases In The Emergence Of Compositional Communication 标题:噪音的催化作用和诱导偏向在作文交际中的必要性 链接:https://arxiv.org/abs/2111.06464
作者:Łukasz Kuciński,Tomasz Korbak,Paweł Kołodziej,Piotr Miłoś 机构:University of Sussex, Polish Academy of Sciences†, Piotr Miło´s, University of Oxford, deepsense.ai 备注:NeurIPS 2021 摘要:如果复杂信号可以表示为更简单子部分的组合,则通信是合成的。在这篇文章中,我们从理论上证明,发展组合通信需要训练框架和数据上的归纳偏差。此外,我们还证明了在信号博弈中,合成性是自发产生的,在这种博弈中,代理通过噪声信道进行通信。我们通过实验证实,一系列噪声水平(取决于模型和数据)确实促进了合成。最后,我们提供了对这种依赖性的全面研究,并根据最近研究的组成性度量报告了结果:地形相似性、冲突计数和上下文独立性。 摘要:Communication is compositional if complex signals can be represented as a combination of simpler subparts. In this paper, we theoretically show that inductive biases on both the training framework and the data are needed to develop a compositional communication. Moreover, we prove that compositionality spontaneously arises in the signaling games, where agents communicate over a noisy channel. We experimentally confirm that a range of noise levels, which depends on the model and the data, indeed promotes compositionality. Finally, we provide a comprehensive study of this dependence and report results in terms of recently studied compositionality metrics: topographical similarity, conflict count, and context independence.