访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
cs.CL 方向,今日共计23篇
Transformer(2篇)
【1】 A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021 标题:2021年VQA挑战赛对抗性训练的基于Transformer的跨模式融合模型
作者:Ke-Han Lu,Bo-Han Fang,Kuan-Yu Chen 机构:Department of CSIE, NTUST, Taiwan, ____________________________________, The two authors contributed equally to this paper. 备注:CVPR 2021 Workshop: Visual Question Answering (VQA) Challenge 链接:https://arxiv.org/abs/2106.13033 摘要:在这篇论文中,受visionlanguage预先训练模型的成功和对抗性攻击训练的好处的启发,我们提出了一种新的基于变换器的跨模式融合模型,将这两种概念结合起来,用于VQA挑战2021。具体而言,所提模型是在VinVL模型的架构之上[19],采用对抗性训练策略[4],使模型具有鲁棒性和通用性。此外,本系统还采用了两种实现技巧,取得了较好的效果。实验结果表明,该框架在VQAv2测试集上可以达到76.72%。 摘要:In this paper, inspired by the successes of visionlanguage pre-trained models and the benefits from training with adversarial attacks, we present a novel transformerbased cross-modal fusion modeling by incorporating the both notions for VQA challenge 2021. Specifically, the proposed model is on top of the architecture of VinVL model [19], and the adversarial training strategy [4] is applied to make the model robust and generalized. Moreover, two implementation tricks are also used in our system to obtain better results. The experiments demonstrate that the novel framework can achieve 76.72% on VQAv2 test-std set.
【2】 Charformer: Fast Character Transformers via Gradient-based Subword Tokenization 标题:Charformer:基于梯度子字标记化的快速字符转换器
作者:Yi Tay,Vinh Q. Tran,Sebastian Ruder,Jai Gupta,Hyung Won Chung,Dara Bahri,Zhen Qin,Simon Baumgartner,Cong Yu,Donald Metzler 机构:Google Research and DeepMind† 链接:https://arxiv.org/abs/2106.12672 摘要:自然语言处理中最先进的模型依赖于独立的刚性子词标记化算法,这限制了它们的泛化能力和对新环境的适应能力。在本文中,我们提出了一个新的模型归纳偏见,学习一个子字符号化作为模型的一部分。为此,我们引入了一个基于软梯度的子词标记化模块(GBST),它以数据驱动的方式从字符中自动学习潜在的子词表示。具体地说,GBST枚举候选子词块并学习使用块评分网络以位置方式对它们进行评分。此外,我们还介绍了Charformer,这是一种深度转换器模型,它集成了GBST并在字节级进行操作。通过对英语胶水、多语言和有噪声的文本数据集的大量实验,我们发现Charformer优于一系列有竞争力的字节级基线,同时通常表现在平价上,有时甚至优于基于子词的模型。此外,Charformer速度很快,将普通字节级和子字级转换器的速度提高了28%-100%,同时保持了有竞争力的质量。我们相信这项工作为完全端到端训练的高性能无令牌模型铺平了道路。 摘要:State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end.
BERT(1篇)
【1】 Unsupervised Topic Segmentation of Meetings with BERT Embeddings 标题:基于BERT嵌入的无监督会议主题分割
作者:Alessandro Solbiati,Kevin Heffernan,Georgios Damaskinos,Shivani Poddar,Shubham Modi,Jacques Cali 机构:Facebook, Inc. 链接:https://arxiv.org/abs/2106.12978 摘要:会议的主题分割是将多人会议记录分割成主题块的任务。由于难以收集和准确注释大型数据集,有监督的方法已被证明是难以解决的。在本文中,我们展示了如何使用预先训练好的神经结构来改进以前的无监督主题分割方法。我们介绍了一种基于BERT嵌入的无监督方法,与现有的无监督方法相比,该方法可以降低15.5%的错误率。 摘要:Topic segmentation of meetings is the task of dividing multi-person meeting transcripts into topic blocks. Supervised approaches to the problem have proven intractable due to the difficulties in collecting and accurately annotating large datasets. In this paper we show how previous unsupervised topic segmentation methods can be improved using pre-trained neural architectures. We introduce an unsupervised approach based on BERT embeddings that achieves a 15.5% reduction in error rate over existing unsupervised approaches applied to two popular datasets for meeting transcripts.
QA|VQA|问答|对话(1篇)
【1】 AIT-QA: Question Answering Dataset over Complex Tables in the Airline Industry 标题:AIT-QA:航空业复杂表格上的问答数据集
作者:Yannis Katsis,Saneem Chemmengath,Vishwajeet Kumar,Samarth Bharadwaj,Mustafa Canim,Michael Glass,Alfio Gliozzo,Feifei Pan,Jaydeep Sen,Karthik Sankaranarayanan,Soumen Chakrabarti 机构:IBM Research ,Rensselaer Polytechnic Institute ,IIT Bombay 链接:https://arxiv.org/abs/2106.12944 摘要:Transformer的最新进展使表问答(tableqa)系统能够在WikiTableQuestions和WikiSQL等开放域数据集上实现高精度和SOTA结果。这类转换器经常在诸如Wikipedia之类的开放域内容上进行预先训练,在Wikipedia中,它们有效地对来自Wikipedia的问题和相应的表进行编码,如表QA数据集中所示。然而,Wikipedia中的web表在布局上明显是扁平的,第一行是唯一的列标题。这种布局有助于表的关系视图,其中每一行都是一个元组。然而,特定于领域的业务或科学文档中的表通常具有更复杂的布局,包括分层的行和列标题,此外还具有来自该领域的专门词汇表术语。为了解决这个问题,我们引入了特定于领域的表QA数据集AIT-QA(航空业表QA)。该数据集由515个问题组成,这些问题是由人工注释者在116个表格上编写的,这些表格摘自美国证券交易委员会的公开文件(公开网址:https://www.sec.gov/edgar.shtml)2017-2019财年的主要航空公司。我们还提供有关问题性质的注释,标记那些需要层次化标题、特定领域术语和释义形式的内容。我们对三种基于transformer的SOTA表QA方法TaPAS(端到端)、TaBERT(基于语义分析)和RCI(基于行-列编码)的零炮基线评估清楚地暴露了这些方法在这种实际设置中的局限性,其最佳准确率仅为51.8%(RCI)。我们还介绍了实用的表预处理步骤,用于将这些复杂的表透视并投影到适合SOTA表QA模型的布局中。 摘要:Recent advances in transformers have enabled Table Question Answering (Table QA) systems to achieve high accuracy and SOTA results on open domain datasets like WikiTableQuestions and WikiSQL. Such transformers are frequently pre-trained on open-domain content such as Wikipedia, where they effectively encode questions and corresponding tables from Wikipedia as seen in Table QA dataset. However, web tables in Wikipedia are notably flat in their layout, with the first row as the sole column header. The layout lends to a relational view of tables where each row is a tuple. Whereas, tables in domain-specific business or scientific documents often have a much more complex layout, including hierarchical row and column headers, in addition to having specialized vocabulary terms from that domain. To address this problem, we introduce the domain-specific Table QA dataset AIT-QA (Airline Industry Table QA). The dataset consists of 515 questions authored by human annotators on 116 tables extracted from public U.S. SEC filings (publicly available at: https://www.sec.gov/edgar.shtml) of major airline companies for the fiscal years 2017-2019. We also provide annotations pertaining to the nature of questions, marking those that require hierarchical headers, domain-specific terminology, and paraphrased forms. Our zero-shot baseline evaluation of three transformer-based SOTA Table QA methods - TaPAS (end-to-end), TaBERT (semantic parsing-based), and RCI (row-column encoding-based) - clearly exposes the limitation of these methods in this practical setting, with the best accuracy at just 51.8% (RCI). We also present pragmatic table preprocessing steps used to pivot and project these complex tables into a layout suitable for the SOTA Table QA models.
机器翻译(1篇)
【1】 On the Influence of Machine Translation on Language Origin Obfuscation 标题:论机器翻译对语源模糊的影响
作者:Benjamin Murauer,Michael Tschuggnall,Günther Specht 机构:Universität Innsbruck 备注:This was peer-reviewed, accepted and presented at this https URL, but the organizer somehow failed to publish the proceedings 链接:https://arxiv.org/abs/2106.12830 摘要:在过去的十年中,机器翻译已经成为处理多语种数字内容的一种流行手段。通过提供更高质量的翻译,模糊文本的源语言变得更有吸引力。在本文中,我们分析了利用机器学习算法从两个广泛使用的商业机器翻译系统的翻译输出中检测源语言的能力,这些算法具有基本的文本特征,如n-grams。评估表明,对于包含足够翻译文本的文档,源语言可以以高精度重建。此外,我们还分析了文档大小如何影响预测性能,以及限制可能的源语言集如何提高分类精度。 摘要:In the last decade, machine translation has become a popular means to deal with multilingual digital content. By providing higher quality translations, obfuscating the source language of a text becomes more attractive. In this paper, we analyze the ability to detect the source language from the translated output of two widely used commercial machine translation systems by utilizing machine-learning algorithms with basic textual features like n-grams. Evaluations show that the source language can be reconstructed with high accuracy for documents that contain a sufficient amount of translated text. In addition, we analyze how the document size influences the performance of the prediction, as well as how limiting the set of possible source languages improves the classification accuracy.
语义分析(3篇)
【1】 Where are we in semantic concept extraction for Spoken Language Understanding? 标题:口语理解的语义概念提取进展如何?
作者:Sahar Ghannay,Antoine Caubrière,Salima Mdhaffar,Gaëlle Laperrière,Bassam Jabaian,Yannick Estève 机构: Universit´e Paris-Saclay, CNRS, LISN, Orsay, France, LIA - Avignon Universit´e, France 备注:Submitted to the SPECOM 2021 conference 链接:https://arxiv.org/abs/2106.13045 摘要:近三年来,随着端到端神经方法的出现,口语理解(SLU)的研究取得了很大的进展。口语理解是指与从语音信号中提取语义相关的自然语言处理任务,如语音命名实体识别或人机对话环境下的填空任务。经典地,SLU任务是通过一个级联方法来处理的,该方法首先应用一个自动语音识别过程,然后应用一个自然语言处理模块来处理自动转录。近三年来,人们提出了基于深度神经网络的端到端神经方法,利用单个神经模型直接从语音信号中提取语义。近年来,关于无标记数据自监督训练的研究为自动语音识别和自然语言处理的性能研究开辟了新的前景。在本文中,我们简要概述了法国媒体基准数据集的最新进展,无论是否使用额外的数据。我们还展示了我们的上一个结果,该结果显著优于目前最先进的系统,概念错误率(CER)为11.2%,而不是今年最后一个最先进系统的13.6%。 摘要:Spoken language understanding (SLU) topic has seen a lot of progress these last three years, with the emergence of end-to-end neural approaches. Spoken language understanding refers to natural language processing tasks related to semantic extraction from speech signal, like named entity recognition from speech or slot filling task in a context of human-machine dialogue. Classically, SLU tasks were processed through a cascade approach that consists in applying, firstly, an automatic speech recognition process, followed by a natural language processing module applied to the automatic transcriptions. These three last years, end-to-end neural approaches, based on deep neural networks, have been proposed in order to directly extract the semantics from speech signal, by using a single neural model. More recent works on self-supervised training with unlabeled data open new perspectives in term of performance for automatic speech recognition and natural language processing. In this paper, we present a brief overview of the recent advances on the French MEDIA benchmark dataset for SLU, with or without the use of additional data. We also present our last results that significantly outperform the current state-of-the-art with a Concept Error Rate (CER) of 11.2%, instead of 13.6% for the last state-of-the-art system presented this year.
【2】 A comprehensive empirical analysis on cross-domain semantic enrichment for detection of depressive language 标题:跨领域语义丰富检测抑郁性语言的综合实证分析
作者:Nawshad Farruque,Randy Goebel,Osmar Zaiane 机构:Department of Computing Science, University of Alberta, Alberta, T,G ,E, Canada 备注:This is an extension over ECML-PKDD, 2019 paper "Augmenting Semantic Representation of Depressive Language: from Forums to Microblogs", with more embedding mapping/augmentation methods and data ablation tests. These experiments were done in the year 2019 链接:https://arxiv.org/abs/2106.12797 摘要:我们分析了在注释数据稀少的情况下,例如,在tweet的抑郁语言检测中,为学习任务设计的单词嵌入特征表示的创建过程。我们首先从一个大的通用数据集中预先训练的丰富的单词嵌入开始,然后通过一个简单的非线性映射机制从一个更小更具体的领域数据集中学习嵌入来增强它。我们还尝试了其他一些更复杂的映射方法,包括基于自动编码器和基于自定义损失函数的方法,这些方法通过逐渐学习接近语义相似的单词和远离语义不同的单词来学习嵌入表示。我们的强化表示更好地捕捉了抑郁症领域的语义,因为它结合了从特定领域学习到的语义和从一般语言中获得的单词覆盖率。我们还提出了一个简单的词袋模型,众所周知的情感和心理语言学词汇,以及一个一般的预先训练的词嵌入表示的比较性能分析。当使用多种不同的机器学习方法(包括抑郁Tweets识别任务中的深度学习模型)作为特征表示时,我们发现我们的增强词嵌入表示比其他方法获得了显著更好的F1分数,特别是当应用于高质量的数据集时。此外,我们还提供了一些数据消融试验,证实了我们的增强技术的有效性。 摘要:We analyze the process of creating word embedding feature representations designed for a learning task when annotated data is scarce, for example, in depressive language detection from Tweets. We start with a rich word embedding pre-trained from a large general dataset, which is then augmented with embeddings learned from a much smaller and more specific domain dataset through a simple non-linear mapping mechanism. We also experimented with several other more sophisticated methods of such mapping including, several auto-encoder based and custom loss-function based methods that learn embedding representations through gradually learning to be close to the words of similar semantics and distant to dissimilar semantics. Our strengthened representations better capture the semantics of the depression domain, as it combines the semantics learned from the specific domain coupled with word coverage from the general language. We also present a comparative performance analyses of our word embedding representations with a simple bag-of-words model, well known sentiment and psycholinguistic lexicons, and a general pre-trained word embedding. When used as feature representations for several different machine learning methods, including deep learning models in a depressive Tweets identification task, we show that our augmented word embedding representations achieve a significantly better F1 score than the others, specially when applied to a high quality dataset. Also, we present several data ablation tests which confirm the efficacy of our augmentation techniques.
【3】 Dealing with training and test segmentation mismatch: FBK@IWSLT2021 标题:处理训练和测试分段不匹配的问题:fbk@IWSLT2021
作者:Sara Papi,Marco Gaido,Matteo Negri,Marco Turchi 机构:Fondazione Bruno Kessler, Trento, Italy, University of Trento, Italy 备注:Accepted at IWSLT2021 链接:https://arxiv.org/abs/2106.12607 摘要:本文描述了FBK向iwslt2021离线语音翻译任务提交的系统。我们参与了一个直接模型,这是一个基于Transformer的体系结构,训练将英语语音音频数据翻译成德语文本。训练管道的特点是知识提炼和两步微调过程。知识提取和第一步微调都是在人工分割的真实数据和合成数据上进行的,后者是在现有语料库上训练机器翻译系统生成的。不同的是,第二个微调步骤是对MuST-cv2 En De数据集的随机分段执行的。它的主要目标是减少在自动分割的音频(即实际的、更真实的测试条件)上评估基于人工分割数据(即理想的、类似句子的分割)训练的语音翻译模型时出现的性能下降。出于同样的目的,在将测试数据传递给系统之前,对测试数据应用一个定制的混合分段过程,该过程同时考虑音频内容(暂停)和生成的分段的长度。在推理时,我们将该方法与基于语音活动检测(VAD)的基线分割方法进行了比较。我们的结果表明,提出的混合方法的有效性,显示了减少差距与手动分割从8.3到1.4 BLEU点。 摘要:This paper describes FBK's system submission to the IWSLT 2021 Offline Speech Translation task. We participated with a direct model, which is a Transformer-based architecture trained to translate English speech audio data into German texts. The training pipeline is characterized by knowledge distillation and a two-step fine-tuning procedure. Both knowledge distillation and the first fine-tuning step are carried out on manually segmented real and synthetic data, the latter being generated with an MT system trained on the available corpora. Differently, the second fine-tuning step is carried out on a random segmentation of the MuST-C v2 En-De dataset. Its main goal is to reduce the performance drops occurring when a speech translation model trained on manually segmented data (i.e. an ideal, sentence-like segmentation) is evaluated on automatically segmented audio (i.e. actual, more realistic testing conditions). For the same purpose, a custom hybrid segmentation procedure that accounts for both audio content (pauses) and for the length of the produced segments is applied to the test data before passing them to the system. At inference time, we compared this procedure with a baseline segmentation method based on Voice Activity Detection (VAD). Our results indicate the effectiveness of the proposed hybrid approach, shown by a reduction of the gap with manual segmentation from 8.3 to 1.4 BLEU points.
Graph|知识图谱|Knowledge(4篇)
【1】 Splitting EUD graphs into trees: A quick and clatty approach 标题:将eud图拆分成树:一种快速而巧妙的方法
作者:Mark Anderson,Carlos Gómez Rodríguez 机构:Carlos G´omez-Rodr´ıguez, Universidade da Coru˜na, CITIC, FASTPARSE Lab, LyS Research Group, Departamento de Ciencias de la Computaci´on y Tecnolog´ıas de la Informaci´on 备注:IWPT 2021 Shared Task system description. To be published in proceedings of the 17th International Conference on Parsing Technologies 链接:https://arxiv.org/abs/2106.13155 摘要:我们展示了FASTPARSE团队在IWPT 2021上提交的EUD共享任务的系统。我们去年通过关注效率参与了这项任务。今年我们集中精力在有限的时间预算内尝试新的想法。我们的系统是基于分裂的EUD图成数棵树,基于语言标准。我们使用序列标记解析器预测这些树,并将它们组合成一个EUD图。结果是相对较差的,虽然不是一个完全的灾难,可能可以改善一些抛光系统的粗糙边缘。 摘要:We present the system submission from the FASTPARSE team for the EUD Shared Task at IWPT 2021. We engaged in the task last year by focusing on efficiency. This year we have focused on experimenting with new ideas on a limited time budget. Our system is based on splitting the EUD graph into several trees, based on linguistic criteria. We predict these trees using a sequence-labelling parser and combine them into an EUD graph. The results were relatively poor, although not a total disaster and could probably be improved with some polishing of the system's rough edges.
【2】 OKGIT: Open Knowledge Graph Link Prediction with Implicit Types 标题:OKGIT:带隐式类型的开放知识图链接预测
作者:Chandrahas,Partha Pratim Talukdar 机构:Indian Institute of Science, Bangalore 备注:Findings of the ACL: ACL-IJCNLP 2021 链接:https://arxiv.org/abs/2106.12806 摘要:开放知识图(OpenKG)是指利用OpenIE工具从语料库中提取的一组(头名词短语、关系短语、尾名词短语)三元组,如(tesla、return to、new york)。虽然OpenKGs很容易为域引导,但它们非常稀疏,远不能直接用于最终任务。因此,预测新事实的任务,即链接预测,成为在文本理解、问答和web搜索推荐等下游任务中使用这些图的一个重要步骤。针对OpenKGs的学习嵌入是最近受到关注的一种链接预测方法。然而,经过仔细研究,我们发现目前的OpenKG链接预测算法通常会对给定的名词和关系短语预测不兼容类型的名词短语(np)。我们在这项工作中解决了这个问题,并提出了OKGIT改进OpenKG链接预测使用新的类型兼容性评分和类型正则化。通过在多个数据集上的大量实验,我们证明了该方法在链路预测任务中产生类型兼容的np的同时达到了最先进的性能。 摘要:Open Knowledge Graphs (OpenKG) refer to a set of (head noun phrase, relation phrase, tail noun phrase) triples such as (tesla, return to, new york) extracted from a corpus using OpenIE tools. While OpenKGs are easy to bootstrap for a domain, they are very sparse and far from being directly usable in an end task. Therefore, the task of predicting new facts, i.e., link prediction, becomes an important step while using these graphs in downstream tasks such as text comprehension, question answering, and web search query recommendation. Learning embeddings for OpenKGs is one approach for link prediction that has received some attention lately. However, on careful examination, we found that current OpenKG link prediction algorithms often predict noun phrases (NPs) with incompatible types for given noun and relation phrases. We address this problem in this work and propose OKGIT that improves OpenKG link prediction using novel type compatibility score and type regularization. With extensive experiments on multiple datasets, we show that the proposed method achieves state-of-the-art performance while producing type compatible NPs in the link prediction task.
【3】 An Automated Knowledge Mining and Document Classification System with Multi-model Transfer Learning 标题:一种基于多模型迁移学习的自动知识挖掘和文档分类系统
作者:Jia Wei Chong,Zhiyuan Chen,Mei Shin Oh 机构:School of Computer Science, University of Nottingham Malaysia, Jalan Broga, Semenyih, Selangor, Malaysia, CAD-IT Consultants (M), Jalan SS,, Kelana Jaya, Petaling Jaya, Selangor, Malaysia 备注:This paper has been submitted to journal of System and Management Sciences 链接:https://arxiv.org/abs/2106.12744 摘要:维修手册文件对工程公司至关重要,因为它们为维修工程师提供指导和知识。然而,由于资源的复杂性,服务工程师从文档中检索特定知识变得不方便和低效。在这项研究中,我们提出了一个自动化的知识挖掘和文件分类系统与新的多模型转移学习方法。特别是采用了三种有效的技术:微调、剪枝和多模型方法,提高了系统的分类性能。微调技术通过增加前馈神经网络层对预先训练好的BERT模型进行优化,剪枝技术用于用新数据对BERT模型进行再训练。多模型方法对多个BERT模型进行初始化和训练,克服了在微调过程中数据排序的随机性。在训练过程的第一次迭代中,同时训练多个BERT模型。然后选择最佳模型进行下一阶段的训练,再进行两次迭代,其他BERT模型的训练过程将终止。通过与两种鲁棒基线方法(BERT和BERT-CNN)的比较,评估了该系统的性能。在一个广泛使用的语言可接受性语料库(CoLA)数据集上的实验结果表明,所提出的方法在准确性和MCC评分方面都优于这些基线方法。 摘要:Service manual documents are crucial to the engineering company as they provide guidelines and knowledge to service engineers. However, it has become inconvenient and inefficient for service engineers to retrieve specific knowledge from documents due to the complexity of resources. In this research, we propose an automated knowledge mining and document classification system with novel multi-model transfer learning approaches. Particularly, the classification performance of the system has been improved with three effective techniques: fine-tuning, pruning, and multi-model method. The fine-tuning technique optimizes a pre-trained BERT model by adding a feed-forward neural network layer and the pruning technique is used to retrain the BERT model with new data. The multi-model method initializes and trains multiple BERT models to overcome the randomness of data ordering during the fine-tuning process. In the first iteration of the training process, multiple BERT models are being trained simultaneously. The best model is then selected for the next phase of the training process with another two iterations and the training processes for other BERT models will be terminated. The performance of the proposed system has been evaluated by comparing with two robust baseline methods, BERT and BERT-CNN. Experimental results on a widely used Corpus of Linguistic Acceptability (CoLA) dataset have shown that the proposed techniques perform better than these baseline methods in terms of accuracy and MCC score.
【4】 Discovering novel drug-supplement interactions using a dietary supplements knowledge graph generated from the biomedical literature 标题:使用从生物医学文献生成的膳食补充剂知识图发现新的药物-补充剂相互作用
作者:Dalton Schutte,Jake Vasilakes,Anu Bompelli,Yuqi Zhou,Marcelo Fiszman,Hua Xu,Halil Kilicoglu,Jeffrey R. Bishop,Terrence Adam,Rui Zhang 机构: Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA, Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis 备注:14 pages, 4 tables, 1 figure 链接:https://arxiv.org/abs/2106.12741 摘要:目的:利用现有的生物医学NLP工具和DS领域术语生成一个包含膳食补充剂(DS)信息的新的、全面的知识图,以发现DS与药物或药物-补充剂相互作用(DSI)之间的相互作用。材料和方法:我们创建了SemRepDS(SemRep的一个扩展),它能够利用DS特定术语(iDISK)从摘要中提取语义关系,其中包含UMLS中没有的28884个DS术语。使用SemRepDS对PubMed摘要进行处理以生成语义关系,然后使用基于PubMedBERT的模型对语义关系进行过滤以删除不正确的关系,然后生成我们的知识图(SuppKG)。两个途径被用来确定潜在的DS药物相互作用,然后由医学专业人员评估其机制的合理性。结果:比较分析发现SemRepDS比SemRep多返回206.9%的DS关系和158.5%的DS实体。微调后的BERT模型的F1值为0.8605,去除了43.86%的关系,比预滤波提高了26.4%。SuppKG由2928个DS特定节点组成。手工回顾发现44(88%)个建议的DS基因药物和32(64%)个建议的DS-Gene1-功能-Gene2-药物途径在机制上是合理的。讨论:使用SemRepDS提取的附加关系生成SuppKG,用于查找当前文献中未找到的合理DSI。根据SuppKG的性质,如果没有扩展的DS术语,使用SemRep不太可能发现这些相互作用。结论:我们成功地扩展了SemRep,使其包含DS信息并产生SuppKG,用于寻找DS与药物的潜在相互作用。 摘要:OBJECTIVE: Leverage existing biomedical NLP tools and DS domain terminology to produce a novel and comprehensive knowledge graph containing dietary supplement (DS) information for discovering interactions between DS and drugs, or Drug-Supplement Interactions (DSI). MATERIALS AND METHODS: We created SemRepDS (an extension of SemRep), capable of extracting semantic relations from abstracts by leveraging a DS-specific terminology (iDISK) containing 28,884 DS terms not found in the UMLS. PubMed abstracts were processed using SemRepDS to generate semantic relations, which were then filtered using a PubMedBERT-based model to remove incorrect relations before generating our knowledge graph (SuppKG). Two pathways are used to identify potential DS-Drug interactions which are then evaluated by medical professionals for mechanistic plausibility. RESULTS: Comparison analysis found that SemRepDS returned 206.9% more DS relations and 158.5% more DS entities than SemRep. The fine-tuned BERT model obtained an F1 score of 0.8605 and removed 43.86% of the relations, improving the precision of the relations by 26.4% compared to pre-filtering. SuppKG consists of 2,928 DS-specific nodes. Manual review of findings identified 44 (88%) proposed DS-Gene-Drug and 32 (64%) proposed DS-Gene1-Function-Gene2-Drug pathways to be mechanistically plausible. DISCUSSION: The additional relations extracted using SemRepDS generated SuppKG that was used to find plausible DSI not found in the current literature. By the nature of the SuppKG, these interactions are unlikely to have been found using SemRep without the expanded DS terminology. CONCLUSION: We successfully extend SemRep to include DS information and produce SuppKG which can be used to find potential DS-Drug interactions.
推理|分析|理解|解释(2篇)
【1】 Towards Understanding and Mitigating Social Biases in Language Models 标题:语言模式中社会偏见的理解与缓解
作者:Paul Pu Liang,Chiyu Wu,Louis-Philippe Morency,Ruslan Salakhutdinov 机构: 20 19; 1Carnegie Mellon University 备注:ICML 2021, code available at this https URL 链接:https://arxiv.org/abs/2106.13219 摘要:随着机器学习方法在医疗保健、法律体系和社会科学等现实环境中的应用,认识到它们如何在这些敏感的决策过程中形成社会偏见和定型观念至关重要。在这样的现实世界部署中,大规模的预训练语言模型(LMs)在表现不受欢迎的代表性偏见方面具有潜在的危险性,这些偏见是由陈规定型观念造成的,这些陈规定型观念传播涉及性别、种族、宗教和其他社会结构的负面概括。作为提高LMs公平性的一个步骤,在提出新的基准和度量标准之前,我们仔细定义了代表性偏差的几个来源。利用这些工具,我们提出了减轻文本生成过程中社会偏见的步骤。我们的实验结果和人的评价表明,在保留高保真文本生成的关键上下文信息的同时,有效地减少了偏见,从而推动了性能公平帕累托前沿。 摘要:As machine learning methods are deployed in real-world settings such as healthcare, legal systems, and social science, it is crucial to recognize how they shape social biases and stereotypes in these sensitive decision-making processes. Among such real-world deployments are large-scale pretrained language models (LMs) that can be potentially dangerous in manifesting undesirable representational biases - harmful biases resulting from stereotyping that propagate negative generalizations involving gender, race, religion, and other social constructs. As a step towards improving the fairness of LMs, we carefully define several sources of representational biases before proposing new benchmarks and metrics to measure them. With these tools, we propose steps towards mitigating social biases during text generation. Our empirical results and human evaluation demonstrate effectiveness in mitigating bias while retaining crucial contextual information for high-fidelity text generation, thereby pushing forward the performance-fairness Pareto frontier.
【2】 Comparative Error Analysis in Neural and Finite-state Models for Unsupervised Character-level Transduction 标题:无监督字符级转换的神经模型和有限状态模型的比较误差分析
作者:Maria Ryskina,Eduard Hovy,Taylor Berg-Kirkpatrick,Matthew R. Gormley 机构:Language Technologies Institute, Carnegie Mellon University, Computer Science and Engineering, University of California, San Diego, Machine Learning Department, Carnegie Mellon University 备注:Accepted to SIGMORPHON 2021 链接:https://arxiv.org/abs/2106.12698 摘要:传统上,字符水平的转换问题是用有限状态模型来解决的,这些模型被设计用来编码潜在过程的结构和语言知识,而最近的方法则依赖于序列到序列模型的能力和灵活性。我们着重于探索较少的无监督学习场景,将两个模型类并排进行比较,发现即使在取得可比成绩的情况下,它们也倾向于犯不同类型的错误。我们使用两个无监督的任务作为测试平台来分析不同错误类别的分布:将非正式的罗马化文本转换为其语言的本族语(俄语、阿拉伯语和卡纳达语)以及在一对密切相关的语言(塞尔维亚语和波斯尼亚语)之间进行翻译。最后,我们研究了解码时有限状态模型和序列到序列模型的结合对输出的定性和定量影响。 摘要:Traditionally, character-level transduction problems have been solved with finite-state models designed to encode structural and linguistic knowledge of the underlying process, whereas recent approaches rely on the power and flexibility of sequence-to-sequence models with attention. Focusing on the less explored unsupervised learning scenario, we compare the two model classes side by side and find that they tend to make different types of errors even when achieving comparable performance. We analyze the distributions of different error classes using two unsupervised tasks as testbeds: converting informally romanized text into the native script of its language (for Russian, Arabic, and Kannada) and translating between a pair of closely related languages (Serbian and Bosnian). Finally, we investigate how combining finite-state and sequence-to-sequence models at decoding time affects the output quantitatively and qualitatively.
识别/分类(2篇)
【1】 Evaluation of Representation Models for Text Classification with AutoML Tools 标题:AutoML工具对文本分类表示模型的评价
作者:Sebastian Brändle,Marc Hanussek,Matthias Blohm,Maximilien Kintz 机构: University of Stuttgart IAT, Institute of Human Factors and Technology, Management, Stuttgart, Germany 备注:Accecpted for Future Technologies Conference 2021 链接:https://arxiv.org/abs/2106.12798 摘要:近年来,自动机器学习(AutoML)在表格数据方面取得了越来越大的成功。然而,处理非结构化数据(如文本)是一个挑战,而且开源AutoML工具并不广泛支持。本文比较了三种手动创建的文本表示和AutoML工具自动创建的文本嵌入。我们的基准测试包括四个流行的开源AutoML工具和八个用于文本分类的数据集。结果表明,直接的文本表示比自动创建文本嵌入的AutoML工具性能更好。 摘要:Automated Machine Learning (AutoML) has gained increasing success on tabular data in recent years. However, processing unstructured data like text is a challenge and not widely supported by open-source AutoML tools. This work compares three manually created text representations and text embeddings automatically created by AutoML tools. Our benchmark includes four popular open-source AutoML tools and eight datasets for text classification purposes. The results show that straightforward text representations perform better than AutoML tools with automatically created text embeddings.
【2】 Clinical Named Entity Recognition using Contextualized Token Representations 标题:使用上下文标记表示的临床命名实体识别
作者:Yichao Zhou,Chelsea Ju,J. Harry Caufield,Kevin Shih,Calvin Chen,Yizhou Sun,Kai-Wei Chang,Peipei Ping,Wei Wang 机构:University of California, Los Angeles 备注:1 figure, 6 tables 链接:https://arxiv.org/abs/2106.12608 摘要:临床命名实体识别(CNER)任务旨在将临床术语定位并分类为预定义的类别,如诊断程序、疾病紊乱、严重程度、药物、药物剂量和体征症状。CNER促进了药物副作用的研究,包括新现象的识别和以人类为中心的信息提取。现有的提取感兴趣实体的方法侧重于使用静态单词嵌入来表示每个单词。然而,一个词可以有不同的解释,这取决于句子的上下文。显然,静态的词嵌入不足以整合一个词的不同解释。为了克服这一挑战,人们引入了语境化词语嵌入技术,以便更好地根据每个词语的上下文来获取其语义。其中ELMo和Flair两种语言模型被广泛应用于自然语言处理领域,用于生成领域通用文档中的上下文化嵌入词。然而,这些嵌入通常过于笼统,无法捕捉特定领域词汇之间的接近性。为了使用临床病例报告(CCR)促进各种下游应用,我们使用PubMed Central的临床相关语料库预先训练了两种深层语境语言模型,即语言模型中的临床嵌入(C-ELMo)和临床语境字符串嵌入(C-Flair)。显式实验表明,与静态词嵌入和领域泛型语言模型相比,我们的模型得到了显著的改进。 摘要:The clinical named entity recognition (CNER) task seeks to locate and classify clinical terminologies into predefined categories, such as diagnostic procedure, disease disorder, severity, medication, medication dosage, and sign symptom. CNER facilitates the study of side-effect on medications including identification of novel phenomena and human-focused information extraction. Existing approaches in extracting the entities of interests focus on using static word embeddings to represent each word. However, one word can have different interpretations that depend on the context of the sentences. Evidently, static word embeddings are insufficient to integrate the diverse interpretation of a word. To overcome this challenge, the technique of contextualized word embedding has been introduced to better capture the semantic meaning of each word based on its context. Two of these language models, ELMo and Flair, have been widely used in the field of Natural Language Processing to generate the contextualized word embeddings on domain-generic documents. However, these embeddings are usually too general to capture the proximity among vocabularies of specific domains. To facilitate various downstream applications using clinical case reports (CCRs), we pre-train two deep contextualized language models, Clinical Embeddings from Language Model (C-ELMo) and Clinical Contextual String Embeddings (C-Flair) using the clinical-related corpus from the PubMed Central. Explicit experiments show that our models gain dramatic improvements compared to both static word embeddings and domain-generic language models.
语料库(1篇)
【1】 QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus 标题:QASR:QCRI Aljazeera语音资源--一个大规模的阿拉伯语注释语音语料库
作者:Hamdy Mubarak,Amir Hussein,Shammur Absar Chowdhury,Ahmed Ali 机构: Qatar Computing Research Institute, HBKU, Doha, Qatar, Kanari AI, California, USA 备注:Speech Corpus, Spoken Conversation, ASR, Dialect Identification, Punctuation Restoration, Speaker Verification, NER, Named Entity, Arabic, Speaker gender, Turn-taking Accepted in ACL 2021 链接:https://arxiv.org/abs/2106.13000 摘要:我们介绍了最大的转录阿拉伯语语音语料库,QASR,收集自广播领域。这个多方言语音数据集包含从半岛电视台新闻频道以16kHz频率采集的2000小时语音。数据集发布时带有轻度监控的转录,与音频片段对齐。与以前的数据集不同,QASR包含了基于语言动机的切分、标点、说话人信息等。QASR适用于训练和评估语音识别系统、基于声学和/或语言学的阿拉伯语方言识别、标点符号恢复、说话人识别、说话人链接,以及潜在的其他用于语音数据的NLP模块。除了QASR转录,我们还发布了1.3亿个单词的数据集,以帮助设计和训练更好的语言模型。我们发现,与先前的MGB-2语料库相比,在QASR上训练的端到端自动语音识别报告了具有竞争力的词错误率。我们报告了下游自然语言处理任务的基线结果,如使用语音记录的命名实体识别。我们还报告了阿拉伯语标点恢复的第一个基线。我们为研究团体提供了语料库。 摘要:We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, punctuation, speaker information among others. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics- based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data. In addition to QASR transcription, we release a dataset of 130M words to aid in designing and training a better language model. We show that end-to-end automatic speech recognition trained on QASR reports a competitive word error rate compared to the previous MGB-2 corpus. We report baseline results for downstream natural language processing tasks such as named entity recognition using speech transcript. We also report the first baseline for Arabic punctuation restoration. We make the corpus available for the research community.
Word2Vec|文本|单词(1篇)
【1】 Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language 标题:当对与目标零资源语言相关的语言进行训练时,声学单词嵌入的多语言迁移得到改善
作者:Christiaan Jacobs,Herman Kamper 机构:E&E Engineering, Stellenbosch University 备注:Accepted to Interspeech 2021 链接:https://arxiv.org/abs/2106.12834 摘要:声学词嵌入模型将可变时长的语音片段映射到固定维向量,从而实现高效的语音搜索和发现。以前的工作探讨了如何在目标语言中没有可用的标记数据的零资源环境中获得嵌入。目前最好的方法是使用迁移学习:使用来自多个资源丰富的语言的标记数据训练单个有监督的多语言模型,然后将其应用于目标零资源语言(无需微调)。然而,目前尚不清楚训练语言的具体选择如何影响下游绩效。具体来说,这里我们要问的是,使用与目标相关的训练语言是否有益。使用来自南部非洲11种语言的数据,我们尝试添加来自不同语系的数据,同时控制每种语言的数据量。在单词识别和示例查询搜索评估中,我们发现对来自同一个语系的语言进行训练可以获得很大的改进。通过更细粒度的分析,我们发现即使只使用一种相关的语言进行训练也能获得最大的收益。我们还发现,添加不相关语言的数据通常不会影响性能。 摘要:Acoustic word embedding models map variable duration speech segments to fixed dimensional vectors, enabling efficient speech search and discovery. Previous work explored how embeddings can be obtained in zero-resource settings where no labelled data is available in the target language. The current best approach uses transfer learning: a single supervised multilingual model is trained using labelled data from multiple well-resourced languages and then applied to a target zero-resource language (without fine-tuning). However, it is still unclear how the specific choice of training languages affect downstream performance. Concretely, here we ask whether it is beneficial to use training languages related to the target. Using data from eleven languages spoken in Southern Africa, we experiment with adding data from different language families while controlling for the amount of data per language. In word discrimination and query-by-example search evaluations, we show that training on languages from the same family gives large improvements. Through finer-grained analysis, we show that training on even just a single related language gives the largest gain. We also find that adding data from unrelated languages generally doesn't hurt performance.
其他神经网络|深度学习|模型|建模(1篇)
【1】 Learning Language and Multimodal Privacy-Preserving Markers of Mood from Mobile Data 标题:移动数据中的语言学习与多模态隐私保护情绪标记物
作者:Paul Pu Liang,Terrance Liu,Anna Cai,Michal Muszynski,Ryo Ishii,Nicholas Allen,Randy Auerbach,David Brent,Ruslan Salakhutdinov,Louis-Philippe Morency 机构:Carnegie Mellon University ,University of Oregon, Columbia University ,University of Pittsburgh 备注:ACL 2021. arXiv admin note: substantial text overlap with arXiv:2012.02359 链接:https://arxiv.org/abs/2106.13213 摘要:即使在普遍享有先进医疗服务的国家,精神健康状况仍然诊断不足。从易于收集的数据中准确有效地预测情绪的能力对于心理健康障碍的早期发现、干预和治疗有着重要的意义。一个有希望帮助监测人类行为的数据源是智能手机的日常使用。但是,必须注意总结行为,不要通过个人(例如,个人可识别信息)或受保护(例如,种族、性别)属性识别用户。在这篇论文中,我们研究了日常情绪的行为标记,使用了一个最新的数据集移动行为的青少年人群中的高风险自杀行为。使用计算模型,我们发现语言和移动类型文本的多模态表示(跨类型字符、单词、按键时间和应用程序使用)可以预测日常情绪。然而,我们发现训练用来预测情绪的模型通常也会在其中间表示中捕获私人用户身份。为了解决这个问题,我们评估了在保持预测性的同时混淆用户身份的方法。通过将多模态表示与隐私保护学习相结合,我们能够推进性能隐私前沿。 摘要:Mental health conditions remain underdiagnosed even in countries with common access to advanced medical care. The ability to accurately and efficiently predict mood from easily collectible data has several important implications for the early detection, intervention, and treatment of mental health disorders. One promising data source to help monitor human behavior is daily smartphone usage. However, care must be taken to summarize behaviors without identifying the user through personal (e.g., personally identifiable information) or protected (e.g., race, gender) attributes. In this paper, we study behavioral markers of daily mood using a recent dataset of mobile behaviors from adolescent populations at high risk of suicidal behaviors. Using computational models, we find that language and multimodal representations of mobile typed text (spanning typed characters, words, keystroke timings, and app usage) are predictive of daily mood. However, we find that models trained to predict mood often also capture private user identities in their intermediate representations. To tackle this problem, we evaluate approaches that obfuscate user identity while remaining predictive. By combining multimodal representations with privacy-preserving learning, we are able to push forward the performance-privacy frontier.
其他(4篇)
【1】 Exploring Self-Identified Counseling Expertise in Online Support Forums 标题:在在线支持论坛中探索自我认同的咨询专业知识
作者:Allison Lahnala,Yuntian Zhao,Charles Welch,Jonathan K. Kummerfeld,Lawrence An,Kenneth Resnicow,Rada Mihalcea,Verónica Pérez-Rosas 机构:Computer Science & Engineering, University of Michigan, Medical School, University of Michigan, School of Public Health, University of Michigan, Center for Health Communications Research, University of Michigan 备注:Accepted to Findings of ACL 2021 链接:https://arxiv.org/abs/2106.12976 摘要:越来越多的人参与在线健康论坛,这使得了解他们收到的建议的质量变得非常重要。在这篇论文中,我们探讨的作用,专业知识的反应提供帮助寻求职位有关心理健康。我们研究了(1)与同龄人交往的差异;(2)与自我认同的心理健康专业人士的互动。首先,我们证明了分类器可以区分这两个群体,这表明他们的语言使用确实不同。为了理解这种差异,我们进行了几项针对参与方面的分析,包括他们的评论是否进一步吸引寻求支持者,以及语言方面,如主导语言和语言风格匹配。我们的工作有助于理解健康专家如何参与健康信息和社会网络中的寻求支持者。更广泛地说,这是朝着更深入地理解互动方式迈出的一步,这种互动方式培养了在线社区中的支持性参与。 摘要:A growing number of people engage in online health forums, making it important to understand the quality of the advice they receive. In this paper, we explore the role of expertise in responses provided to help-seeking posts regarding mental health. We study the differences between (1) interactions with peers; and (2) interactions with self-identified mental health professionals. First, we show that a classifier can distinguish between these two groups, indicating that their language use does in fact differ. To understand this difference, we perform several analyses addressing engagement aspects, including whether their comments engage the support-seeker further as well as linguistic aspects, such as dominant language and linguistic style matching. Our work contributes toward the developing efforts of understanding how health experts engage with health information- and support-seekers in social networks. More broadly, it is a step toward a deeper understanding of the styles of interactions that cultivate supportive engagement in online communities.
【2】 Modeling Diagnostic Label Correlation for Automatic ICD Coding 标题:ICD自动编码的诊断标签关联建模
作者:Shang-Chi Tsai,Chao-Wei Huang,Yun-Nung Chen 机构:National Taiwan University, Taipei, Taiwan 备注:NAACL 2021 Long Paper. Code available at this https URL 链接:https://arxiv.org/abs/2106.12800 摘要:鉴于电子健康记录(EHRs)中的临床记录,预测诊断代码是一项具有挑战性的多标签分类任务。大的标签集、层次依赖性和不平衡的数据使得这个预测任务非常困难。大多数现有的工作都独立地为每个标签构建了一个二进制预测,忽略了标签之间的依赖关系。为了解决这个问题,我们提出了一个两阶段的框架,通过捕获标签相关性来改进自动ICD编码。具体地说,我们训练一个标签集分布估计器来重新计算由基预测器产生的每个标签集候选的概率。本文首次尝试学习标签集分布作为医学编码预测的重排序模块。在实验中,我们提出的框架能够改进在基准模拟数据集上的最佳性能预测。这个项目的源代码可以在https://github.com/MiuLab/ICD-Correlation. 摘要:Given the clinical notes written in electronic health records (EHRs), it is challenging to predict the diagnostic codes which is formulated as a multi-label classification task. The large set of labels, the hierarchical dependency, and the imbalanced data make this prediction task extremely hard. Most existing work built a binary prediction for each label independently, ignoring the dependencies between labels. To address this problem, we propose a two-stage framework to improve automatic ICD coding by capturing the label correlation. Specifically, we train a label set distribution estimator to rescore the probability of each label set candidate generated by a base predictor. This paper is the first attempt at learning the label set distribution as a reranking module for medical code prediction. In the experiments, our proposed framework is able to improve upon best-performing predictors on the benchmark MIMIC datasets. The source code of this project is available at https://github.com/MiuLab/ICD-Correlation.
【3】 TagRuler: Interactive Tool for Span-Level Data Programming by Demonstration 标题:TagRuler:跨度级数据演示编程的交互式工具
作者:Dongjin Choi,Sara Evensen,Çağatay Demiralp,Estevam Hruschka 机构:Georgia Institute of Techonology, USA, Megagon Labs, USA, Sigma Computing, USA 备注:WWW'21 Demo 链接:https://arxiv.org/abs/2106.12767 摘要:尽管机器学习研究领域发展迅速,但收集高质量的标签用于监督学习仍然是许多应用的瓶颈。由于NLP任务的最新模型变得越来越深、越来越复杂,甚至微调所需的训练数据量也经常增加,这一事实加剧了这一困难。弱监督方法,包括数据编程,解决了这一问题,并通过使用噪声标签源进行监督来降低标签收集的成本。然而,直到最近,数据编程只对知道如何编程的用户开放。为了弥补这一差距,提出了基于演示框架的数据编程,以便于基于领域专家标注的几个实例自动创建标注函数。该框架已经成功地生成了用于文档分类的高精度标签模型。在这项工作中,我们将DPBD框架扩展到跨级别的注释任务,可以说是最耗时的NLP标记任务之一。我们构建了一个新的工具TagRuler,它使得注释者不需要编程就可以轻松地构建跨级别的标记函数,并鼓励他们在不同的标记模型和主动学习策略之间进行权衡。我们的经验表明,注释者可以获得更高的F1分数使用所提出的工具相比,手动标记不同跨度水平的注释任务。 摘要:Despite rapid developments in the field of machine learning research, collecting high-quality labels for supervised learning remains a bottleneck for many applications. This difficulty is exacerbated by the fact that state-of-the-art models for NLP tasks are becoming deeper and more complex, often increasing the amount of training data required even for fine-tuning. Weak supervision methods, including data programming, address this problem and reduce the cost of label collection by using noisy label sources for supervision. However, until recently, data programming was only accessible to users who knew how to program. To bridge this gap, the Data Programming by Demonstration framework was proposed to facilitate the automatic creation of labeling functions based on a few examples labeled by a domain expert. This framework has proven successful for generating high-accuracy labeling models for document classification. In this work, we extend the DPBD framework to span-level annotation tasks, arguably one of the most time-consuming NLP labeling tasks. We built a novel tool, TagRuler, that makes it easy for annotators to build span-level labeling functions without programming and encourages them to explore trade-offs between different labeling models and active learning strategies. We empirically demonstrated that an annotator could achieve a higher F1 score using the proposed tool compared to manual labeling for different span-level annotation tasks.
【4】 Bidding via Clustering Ads Intentions: an Efficient Search Engine Marketing System for E-commerce 标题:广告意向聚类竞价:一种高效的电子商务搜索引擎营销系统
作者:Cheng Jie,Da Xu,Zigeng Wang,Lu Wang,Wei Shen 机构:Walmart Labs, Sunnyvale, California, USA 链接:https://arxiv.org/abs/2106.12700 摘要:随着搜索引擎营销规模的不断扩大,设计一个高效的竞价系统成为电子商务企业成功的关键。现代工业级招投标系统面临的主要挑战有:1.目录庞大,相关招投标特征高度稀疏;2.大量的投标请求给离线和在线服务带来了巨大的计算负担。利用无关的用户项信息来缓解稀疏性问题是非常必要的,为此,我们利用了来自用户查询的自然语言信号和来自产品的上下文知识。特别地,我们通过Transformer模型提取广告的向量表示,并利用它们的几何关系通过聚类建立协同投标预测。这两个步骤也大大降低了评标和优化的计算压力。本文介绍了沃尔玛电子商务搜索引擎营销竞价系统的端到端结构,该系统每天成功处理数千万次竞价。我们分析了我们的方法的在线和离线性能,并讨论了如何将其作为一个高效的生产解决方案。 摘要:With the increasing scale of search engine marketing, designing an efficient bidding system is becoming paramount for the success of e-commerce companies. The critical challenges faced by a modern industrial-level bidding system include: 1. the catalog is enormous, and the relevant bidding features are of high sparsity; 2. the large volume of bidding requests induces significant computation burden to both the offline and online serving. Leveraging extraneous user-item information proves essential to mitigate the sparsity issue, for which we exploit the natural language signals from the users' query and the contextual knowledge from the products. In particular, we extract the vector representations of ads via the Transformer model and leverage their geometric relation to building collaborative bidding predictions via clustering. The two-step procedure also significantly reduces the computation stress of bid evaluation and optimization. In this paper, we introduce the end-to-end structure of the bidding system for search engine marketing for Walmart e-commerce, which successfully handles tens of millions of bids each day. We analyze the online and offline performances of our approach and discuss how we find it as a production-efficient solution.