Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
cs.LG 方向,今日共计46篇
Graph相关(图学习|图神经网络|图优化等)(3篇)
【1】 Computing Graph Descriptors on Edge Streams 标题:基于边缘流的图描述子计算 链接:https://arxiv.org/abs/2109.01494
作者:Zohair Raza Hassan,Imdadullah Khan,Mudassir Shabbir,Waseem Abbas 机构:Rochester Institute of Technology, USA, Lahore University of Management Sciences, Pakistan, Information Technology University, Pakistan, The University of Texas at Dallas, USA 备注:Extension of work accepted to PAKDD 2020 摘要:图形特征提取是图形分析中的一项基本任务。将特征向量(图描述符)与对欧几里德数据进行操作的数据挖掘算法结合使用,可以解决对图结构数据进行分类、聚类和异常检测等问题。这一想法在过去被证明是卓有成效的,基于光谱的图描述符在基准数据集上提供了最先进的分类精度。但是,这些算法不能扩展到大型图,因为:1)它们需要将整个图存储在内存中,2)最终用户无法控制算法的运行时。在本文中,我们提出了单次流算法来近似图的结构特征(顺序为$kgeq 4$的子图的计数)。对边缘流进行操作可以避免将整个图形保存在内存中,控制样本大小可以控制算法所花费的时间。我们通过分析近似误差、分类精度和对海量图的可伸缩性来证明描述符的有效性。我们的实验展示了样本大小对近似误差和预测精度的影响。所提出的描述符适用于在几分钟内具有数百万条边的图,并且在分类精度方面优于最先进的描述符。 摘要:Graph feature extraction is a fundamental task in graphs analytics. Using feature vectors (graph descriptors) in tandem with data mining algorithms that operate on Euclidean data, one can solve problems such as classification, clustering, and anomaly detection on graph-structured data. This idea has proved fruitful in the past, with spectral-based graph descriptors providing state-of-the-art classification accuracy on benchmark datasets. However, these algorithms do not scale to large graphs since: 1) they require storing the entire graph in memory, and 2) the end-user has no control over the algorithm's runtime. In this paper, we present single-pass streaming algorithms to approximate structural features of graphs (counts of subgraphs of order $k geq 4$). Operating on edge streams allows us to avoid keeping the entire graph in memory, and controlling the sample size enables us to control the time taken by the algorithm. We demonstrate the efficacy of our descriptors by analyzing the approximation error, classification accuracy, and scalability to massive graphs. Our experiments showcase the effect of the sample size on approximation error and predictive accuracy. The proposed descriptors are applicable on graphs with millions of edges within minutes and outperform the state-of-the-art descriptors in classification accuracy.
【2】 LG4AV: Combining Language Models and Graph Neural Networks for Author Verification 标题:LG4AV:结合语言模型和图神经网络的作者认证 链接:https://arxiv.org/abs/2109.01479
作者:Maximilian Stubbemann,Gerd Stumme 机构:L,S Research Center and University of Kassel, Kassel, Germany, University of Kassel and L,S Research Center 备注:9 pages, 1 figure 摘要:文档作者身份的自动验证在各种设置中都很重要。例如,研究人员根据其出版物的数量和影响进行判断和比较,公众人物面对他们在社交媒体平台上的帖子。因此,经常使用的web服务和平台中的作者信息是正确的是很重要的。给定文档是否由给定作者编写的问题通常称为作者身份验证(AV)。虽然AV通常是一个被广泛研究的问题,但只有很少的作品考虑文档短且写得相当统一的设置。这使得大多数方法在学术领域的在线数据库和知识图中不实用。在这里,必须核实科学出版物的作者身份,通常只有摘要和标题。在这一点上,我们提出了我们的新方法LG4AV,它结合了语言模型和图形神经网络进行作者身份验证。通过在预先训练好的transformer架构中直接输入可用文本,我们的模型不需要任何手工制作的笔迹特征,这些特征在书写风格至少在某种程度上是标准化的场景中没有意义。通过加入图形神经网络结构,我们的模型可以从作者之间的关系中获益,这些关系对于验证过程是有意义的。例如,科学作者更有可能写他们的合著者所涉及的主题,推特用户倾向于发布与他们所关注的人相同的主题。我们通过实验评估了我们的模型,并研究了在文献计量环境中,合作作者的加入在多大程度上增强了验证决策。 摘要:The automatic verification of document authorships is important in various settings. Researchers are for example judged and compared by the amount and impact of their publications and public figures are confronted by their posts on social media platforms. Therefore, it is important that authorship information in frequently used web services and platforms is correct. The question whether a given document is written by a given author is commonly referred to as authorship verification (AV). While AV is a widely investigated problem in general, only few works consider settings where the documents are short and written in a rather uniform style. This makes most approaches unpractical for online databases and knowledge graphs in the scholarly domain. Here, authorships of scientific publications have to be verified, often with just abstracts and titles available. To this point, we present our novel approach LG4AV which combines language models and graph neural networks for authorship verification. By directly feeding the available texts in a pre-trained transformer architecture, our model does not need any hand-crafted stylometric features that are not meaningful in scenarios where the writing style is, at least to some extent, standardized. By the incorporation of a graph neural network structure, our model can benefit from relations between authors that are meaningful with respect to the verification process. For example, scientific authors are more likely to write about topics that are addressed by their co-authors and twitter users tend to post about the same subjects as people they follow. We experimentally evaluate our model and study to which extent the inclusion of co-authorships enhances verification decisions in bibliometric environments.
【3】 Edge-featured Graph Neural Architecture Search 标题:边特征图神经网络结构搜索 链接:https://arxiv.org/abs/2109.01356
作者:Shaofei Cai,Liang Li,Xinzhe Han,Zheng-jun Zha,Qingming Huang 机构:Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS, Beijing, China, University of Chinese Academy of Sciences, Beijing, China, University of Science and Technology of China, China,Peng Cheng Laboratory, Shenzhen, China 摘要:图神经网络(GNNs)已成功地应用于许多关系任务中的图表示学习。最近,研究人员研究了神经结构搜索(NAS),以减少对人类专业知识的依赖,探索更好的GNN结构,但他们过分强调实体特征,忽略了隐藏在边缘中的潜在关系信息。为了解决这个问题,我们将边缘特征融入到图搜索空间中,并提出边缘特征图神经结构搜索来寻找最优GNN结构。具体地说,我们设计了丰富的实体和边更新操作来学习高阶表示,这传递了更通用的消息传递机制。此外,我们的搜索空间中的架构拓扑允许探索实体和边缘的复杂特征依赖性,这可以通过可微搜索策略进行有效优化。在六个数据集上进行的三个图形任务的实验表明,EGNAS可以搜索性能更好的GNN,比目前最先进的基于人类设计和搜索的GNN具有更高的性能。 摘要:Graph neural networks (GNNs) have been successfully applied to learning representation on graphs in many relational tasks. Recently, researchers study neural architecture search (NAS) to reduce the dependence of human expertise and explore better GNN architectures, but they over-emphasize entity features and ignore latent relation information concealed in the edges. To solve this problem, we incorporate edge features into graph search space and propose Edge-featured Graph Neural Architecture Search to find the optimal GNN architecture. Specifically, we design rich entity and edge updating operations to learn high-order representations, which convey more generic message passing mechanisms. Moreover, the architecture topology in our search space allows to explore complex feature dependence of both entities and edges, which can be efficiently optimized by differentiable search strategy. Experiments at three graph tasks on six datasets show EGNAS can search better GNNs with higher performance than current state-of-the-art human-designed and searched-based GNNs.
Transformer(1篇)
【1】 Biomedical Data-to-Text Generation via Fine-Tuning Transformers 标题:基于微调Transformer的生物医学数据到文本生成 链接:https://arxiv.org/abs/2109.01518
作者:Ruslan Yermakov,Nicholas Drago,Angelo Ziletti 机构:Decision Science, & Advanced Analytics, Bayer AG, Regulatory Policy, and Intelligence 备注:Accepted at ACL-INGL2021 (International Conference on Natural Language Generation organised by the Association for Computational Linguistics) 摘要:生物医学领域中的数据到文本(D2T)生成是一个很有前途的研究领域,但大部分尚未开发。在这里,我们将D2T生成的神经模型应用于由欧洲药品包装传单组成的真实数据集。我们表明,微调Transformer能够从生物医学领域的数据中生成真实、多段的文本,但也有重要的局限性。我们还发布了一个新的数据集(生物传单),用于在生物医学领域对D2T生成模型进行基准测试。 摘要:Data-to-text (D2T) generation in the biomedical domain is a promising - yet mostly unexplored - field of research. Here, we apply neural models for D2T generation to a real-world dataset consisting of package leaflets of European medicines. We show that fine-tuned transformers are able to generate realistic, multisentence text from data in the biomedical domain, yet have important limitations. We also release a new dataset (BioLeaflets) for benchmarking D2T generation models in the biomedical domain.
GAN|对抗|攻击|生成相关(4篇)
【1】 Contrastive Representation Learning for Exemplar-Guided Paraphrase Generation 标题:基于对比表征学习的范例引导释义生成 链接:https://arxiv.org/abs/2109.01484
作者:Haoran Yang,Wai Lam,Piji Li 机构:The Chinese University of Hong Kong, Tencent AI Lab 备注:Findings of EMNLP 2021 摘要:范例引导释义生成(EGPG)旨在生成符合给定范例风格的目标句子,同时封装源句子的内容信息。在本文中,我们提出了一种新的方法,目的是学习更好的风格和内容表示。该方法的主要动机是对比学习的最新成功,对比学习在无监督的特征提取任务中显示了其强大的能力。其思想是通过考虑训练过程中的两个问题特征,在内容和风格方面设计两个对比损失。一个特点是目标句与源句具有相同的内容,第二个特点是目标句与范例句具有相同的风格。这两种对比损失被合并到通用编码器-解码器范例中。在两个数据集上的实验,即QQP Pos和ParaNMT,证明了我们提出的压缩损失的有效性。 摘要:Exemplar-Guided Paraphrase Generation (EGPG) aims to generate a target sentence which conforms to the style of the given exemplar while encapsulating the content information of the source sentence. In this paper, we propose a new method with the goal of learning a better representation of the style andthe content. This method is mainly motivated by the recent success of contrastive learning which has demonstrated its power in unsupervised feature extraction tasks. The idea is to design two contrastive losses with respect to the content and the style by considering two problem characteristics during training. One characteristic is that the target sentence shares the same content with the source sentence, and the second characteristic is that the target sentence shares the same style with the exemplar. These two contrastive losses are incorporated into the general encoder-decoder paradigm. Experiments on two datasets, namely QQP-Pos and ParaNMT, demonstrate the effectiveness of our proposed constrastive losses.
【2】 A Synergetic Attack against Neural Network Classifiers combining Backdoor and Adversarial Examples 标题:对后门和对抗性示例相结合的神经网络分类器的协同攻击 链接:https://arxiv.org/abs/2109.01275
作者:Guanxiong Liu,Issa Khalil,Abdallah Khreishah,NhatHai Phan 机构: New Jersey Institute of Technology, Newark, USA, Qatar Computing Research Institute, Doha, Qatar 摘要:在这项工作中,我们展示了如何联合利用对抗性干扰和模型中毒漏洞来实际发起一种新的秘密攻击,称为Adv特洛伊木马。AdvTrojan是隐形的,因为它只有在以下情况下才能被激活:1)在推理过程中向输入示例中注入精心制作的敌对干扰,2)在模型的训练过程中植入特洛伊木马后门。我们利用输入空间中的对抗性噪声,将感染特洛伊木马的示例跨模型决策边界移动,使其难以检测。AdvTrojan的隐蔽性行为愚弄了用户,使其意外地信任受感染的模型作为对抗性示例的健壮分类器。AdvTrojan只能通过毒害训练数据来实现,类似于传统的特洛伊木马后门攻击。我们在几个基准数据集上进行的深入分析和大量实验表明,AdvTrojan可以绕过现有防御,在我们的大多数实验场景中,成功率接近100%,还可以扩展到攻击联合学习任务。 摘要:In this work, we show how to jointly exploit adversarial perturbation and model poisoning vulnerabilities to practically launch a new stealthy attack, dubbed AdvTrojan. AdvTrojan is stealthy because it can be activated only when: 1) a carefully crafted adversarial perturbation is injected into the input examples during inference, and 2) a Trojan backdoor is implanted during the training process of the model. We leverage adversarial noise in the input space to move Trojan-infected examples across the model decision boundary, making it difficult to detect. The stealthiness behavior of AdvTrojan fools the users into accidentally trust the infected model as a robust classifier against adversarial examples. AdvTrojan can be implemented by only poisoning the training data similar to conventional Trojan backdoor attacks. Our thorough analysis and extensive experiments on several benchmark datasets show that AdvTrojan can bypass existing defenses with a success rate close to 100% in most of our experimental scenarios and can be extended to attack federated learning tasks as well.
【3】 Multimodal Conditionality for Natural Language Generation 标题:自然语言生成中的多模态条件性 链接:https://arxiv.org/abs/2109.01229
作者:Michael Sollami,Aashish Jain 摘要:大规模的预训练语言模型在语言理解任务中表现出了最先进的性能。它们的应用最近扩展到多模态学习,从而改进了视觉和语言相结合的表征。然而,使语言模型适应条件自然语言生成(NLG)的进展仅限于单一模态,通常是文本。我们提出了MAnTiS,文本合成的多模态适应,这是一种基于变换的NLG模型中多模态条件的通用方法。在该方法中,我们将来自每个模态的输入通过特定于模态的编码器传递,投射到文本标记空间,最后连接形成条件前缀。我们使用条件前缀对预训练语言模型和编码器进行微调,以指导生成。我们将螳螂应用于产品描述生成任务,在产品图像和标题上调节网络以生成描述性文本。我们证明,螳螂在标准NLG评分指标上优于强基线方法。此外,定性评估表明,螳螂可以生成与给定多模态输入一致的人类质量描述。 摘要:Large scale pretrained language models have demonstrated state-of-the-art performance in language understanding tasks. Their application has recently expanded into multimodality learning, leading to improved representations combining vision and language. However, progress in adapting language models towards conditional Natural Language Generation (NLG) has been limited to a single modality, generally text. We propose MAnTiS, Multimodal Adaptation for Text Synthesis, a general approach for multimodal conditionality in transformer-based NLG models. In this method, we pass inputs from each modality through modality-specific encoders, project to textual token space, and finally join to form a conditionality prefix. We fine-tune the pretrained language model and encoders with the conditionality prefix guiding the generation. We apply MAnTiS to the task of product description generation, conditioning a network on both product images and titles to generate descriptive text. We demonstrate that MAnTiS outperforms strong baseline approaches on standard NLG scoring metrics. Furthermore, qualitative assessments demonstrate that MAnTiS can generate human quality descriptions consistent with given multimodal inputs.
【4】 Relating the Partial Dependence Plot and Permutation Feature Importance to the Data Generating Process 标题:将偏相关图和排列特征重要性与数据生成过程相关联 链接:https://arxiv.org/abs/2109.01433
作者:Christoph Molnar,Timo Freiesleben,Gunnar König,Giuseppe Casalicchio,Marvin N. Wright,Bernd Bischl 机构: BischlLudwig-Maximilian University Munich, K¨onigUniversity of Vienna 摘要:科学家和实践者越来越依赖机器学习来建模数据和得出结论。与统计建模方法相比,机器学习对数据结构(如线性)的明确假设更少。然而,它们的模型参数通常不能很容易地与数据生成过程相关联。为了了解模型关系,通常使用部分相关(PD)图和排列特征重要性(PFI)作为解释方法。然而,PD和PFI缺乏将其与数据生成过程联系起来的理论。我们将PD和PFI形式化为根植于数据生成过程中的基本真值估计的统计估计。我们表明,由于统计偏差、模型方差和蒙特卡罗近似误差,PD和PFI估计偏离了这一基本事实。为了解释PD和PFI估计中的模型方差,我们提出了学习者PD和基于模型修正的学习者PFI,并提出了修正方差和置信区间估计。 摘要:Scientists and practitioners increasingly rely on machine learning to model data and draw conclusions. Compared to statistical modeling approaches, machine learning makes fewer explicit assumptions about data structures, such as linearity. However, their model parameters usually cannot be easily related to the data generating process. To learn about the modeled relationships, partial dependence (PD) plots and permutation feature importance (PFI) are often used as interpretation methods. However, PD and PFI lack a theory that relates them to the data generating process. We formalize PD and PFI as statistical estimators of ground truth estimands rooted in the data generating process. We show that PD and PFI estimates deviate from this ground truth due to statistical biases, model variance and Monte Carlo approximation errors. To account for model variance in PD and PFI estimation, we propose the learner-PD and the learner-PFI based on model refits, and propose corrected variance and confidence interval estimators.
半/弱/无/有监督|不确定性|主动学习(4篇)
【1】 Impact of GPU uncertainty on the training of predictive deep neural networks 标题:GPU不确定性对预测深度神经网络训练的影响 链接:https://arxiv.org/abs/2109.01451
作者:Maciej Pietrowski,Andrzej Gajda,Takuto Yamamoto,Taisuke Kobayashi,Lana Sinapayen,Eiji Watanabe 机构:Adam Mickiewicz University, ´Swi˛ety Marcin , Pozna´n ,-, Poland, Osaka University, Yamadaoka ,-, Suita, Osaka,-, Japan, Laboratory of Neurophysiology, National Institute for Basic Biology, Higashiyama ,-, Myodaiji-cho, Okazaki, Aichi,-, Japan 摘要:深度神经网络通常呈现不确定性,如硬件和软件衍生的噪声和随机性。我们研究了这种不确定性对学习结果的影响,特别关注图形处理单元(GPU)的功能,发现GPU引起的不确定性提高了某种深层神经网络的学习精度。当仅使用CPU而不使用GPU训练预测性深层神经网络时,学习误差高于使用GPU训练相同次数的历元时的误差,这表明GPU在学习过程中所起的作用与仅提高计算速度不同。由于这种效应不能通过简单的自动编码器在学习中观察到,因此它可能是特定于某些类型的神经网络的一种现象。GPU特定的计算处理比CPU更不确定,硬件衍生的不确定性(通常被认为是需要消除的障碍)在某些情况下可能会成功地融入深度神经网络的训练中。此外,在大脑相关的计算处理中,这种不确定性可能是有趣的现象,包括大量的不确定信号。 摘要:Deep neural networks often present uncertainties such as hardware- and software-derived noise and randomness. We studied the effects of such uncertainty on learning outcomes, with a particular focus on the function of graphics processing units (GPUs), and found that GPU-induced uncertainty increased learning accuracy of a certain deep neural network. When training a predictive deep neural network using only the CPU without the GPU, the learning error is higher than when training the same number of epochs using the GPU, suggesting that the GPU plays a different role in the learning process than just increasing the computational speed. Because this effect cannot be observed in learning by a simple autoencoder, it could be a phenomenon specific to certain types of neural networks. GPU-specific computational processing is more indeterminate than that by CPUs, and hardware-derived uncertainties, which are often considered obstacles that need to be eliminated, might, in some cases, be successfully incorporated into the training of deep neural networks. Moreover, such uncertainties might be interesting phenomena to consider in brain-related computational processing, which comprises a large mass of uncertain signals.
【2】 Segmentation of turbulent computational fluid dynamics simulations with unsupervised ensemble learning 标题:基于无监督集成学习的湍流计算流体力学模拟分割 链接:https://arxiv.org/abs/2109.01381
作者:Maarja Bussov,Joonas Nättilä 机构:Tartu Observatory, University of Tartu, Observatooriumi ,-zip, T˜oravere, Estonia, Department of Physics, University of Helsinki, P.O. Box , FI-, Helsinki, Finland 备注:15 pages, 8 figures. Accepted to Signal Processing: Image Communication. Code available from a repository: this https URL 摘要:计算机视觉和机器学习工具为从复杂的计算机模拟中自动分析和分类信息提供了一种令人兴奋的新方法。在这里,我们设计了一个集成机器学习框架,该框架能够独立且稳健地将湍流模式的模拟数据输出内容分类并分解为不同的结构目录。使用无监督聚类算法进行分割,该算法通过将模拟图像中的相似像素分组在一起来分割物理结构。通过组合来自多个同时评估的聚类操作的信息,提高了生成的分段区域边界的准确性和鲁棒性。使用图像掩码组合操作执行对象分割评估的堆叠。这种不同簇掩码的统计组合集成(SCE)允许我们为每个像素和相关片段构建簇可靠性度量,而无需任何事先用户输入。通过比较集合中不同簇出现的相似性,我们还可以评估描述数据所需的最佳簇数。此外,通过依赖于集合平均的空间段区域边界,SCE方法能够为不同的图像数据簇重建更精确和鲁棒的感兴趣区域(ROI)边界。我们将SCE算法应用于磁主导的全动力学湍流等离子体流的二维模拟数据快照,其中需要精确的ROI边界来对称为电流片的间歇流结构进行几何测量。 摘要:Computer vision and machine learning tools offer an exciting new way for automatically analyzing and categorizing information from complex computer simulations. Here we design an ensemble machine learning framework that can independently and robustly categorize and dissect simulation data output contents of turbulent flow patterns into distinct structure catalogues. The segmentation is performed using an unsupervised clustering algorithm, which segments physical structures by grouping together similar pixels in simulation images. The accuracy and robustness of the resulting segment region boundaries are enhanced by combining information from multiple simultaneously-evaluated clustering operations. The stacking of object segmentation evaluations is performed using image mask combination operations. This statistically-combined ensemble (SCE) of different cluster masks allows us to construct cluster reliability metrics for each pixel and for the associated segments without any prior user input. By comparing the similarity of different cluster occurrences in the ensemble, we can also assess the optimal number of clusters needed to describe the data. Furthermore, by relying on ensemble-averaged spatial segment region boundaries, the SCE method enables reconstruction of more accurate and robust region of interest (ROI) boundaries for the different image data clusters. We apply the SCE algorithm to 2-dimensional simulation data snapshots of magnetically-dominated fully-kinetic turbulent plasma flows where accurate ROI boundaries are needed for geometrical measurements of intermittent flow structures known as current sheets.
【3】 Entity Linking and Discovery via Arborescence-based Supervised Clustering 标题:基于树形监督聚类的实体链接与发现 链接:https://arxiv.org/abs/2109.01242
作者:Dhruv Agarwal,Rico Angell,Nicholas Monath,Andrew McCallum 机构:College of Information and Computer Sciences, University of Massachusetts Amherst 摘要:以前的工作表明,通过不仅测量提及和实体之间的亲和力,而且还测量提及之间的亲和力,在执行实体链接方面取得了有希望的结果。在本文中,我们提出了一种新的训练和推理程序,该程序通过在文档中建立关于提及和实体的最小树状图(即有向生成树)来充分利用提及之间的亲和力,从而做出链接决策。我们还表明,该方法可以很好地扩展到实体发现,支持对知识库中没有关联实体的提及进行聚类。我们在Zero-Shot实体链接数据集和MedInetries上评估了我们的方法,MedInetries是最大的公开生物医学数据集,与相同的参数化模型相比,实体链接和发现的性能都有显著提高。我们进一步表明,与以前的工作相比,使用计算成本更高的模型,效率有了显著的提高,但精度损失很小。 摘要:Previous work has shown promising results in performing entity linking by measuring not only the affinities between mentions and entities but also those amongst mentions. In this paper, we present novel training and inference procedures that fully utilize mention-to-mention affinities by building minimum arborescences (i.e., directed spanning trees) over mentions and entities across documents in order to make linking decisions. We also show that this method gracefully extends to entity discovery, enabling the clustering of mentions that do not have an associated entity in the knowledge base. We evaluate our approach on the Zero-Shot Entity Linking dataset and MedMentions, the largest publicly available biomedical dataset, and show significant improvements in performance for both entity linking and discovery compared to identically parameterized models. We further show significant efficiency improvements with only a small loss in accuracy over previous work, which use more computationally expensive models.
【4】 Sample Noise Impact on Active Learning 标题:样本噪声对主动学习的影响 链接:https://arxiv.org/abs/2109.01372
作者:Alexandre Abraham,Léo Dreyfus-Schmidt 机构:and L´eo Dreyfus-Schmidt,r,´,´,´,s, Dataiku, Paris, France 备注:None 摘要:本研究探讨了噪音样本选择对主动学习策略的影响。我们在合成问题和实际使用案例中都表明,样本噪声知识可以显著提高主动学习策略的性能。在先前工作的基础上,我们提出了一种稳健的取样器,即增量加权K-均值,它对合成任务带来了显著的改进,但对现实生活中的任务只有轻微的提升。我们希望本文提出的问题能引起社区的兴趣,并为主动学习研究开辟新的途径。 摘要:This work explores the effect of noisy sample selection in active learning strategies. We show on both synthetic problems and real-life use-cases that knowledge of the sample noise can significantly improve the performance of active learning strategies. Building on prior work, we propose a robust sampler, Incremental Weighted K-Means that brings significant improvement on the synthetic tasks but only a marginal uplift on real-life ones. We hope that the questions raised in this paper are of interest to the community and could open new paths for active learning research.
迁移|Zero/Few/One-Shot|自适应(1篇)
【1】 A Bayesian Approach to (Online) Transfer Learning: Theory and Algorithms 标题:(在线)迁移学习的贝叶斯方法:理论与算法 链接:https://arxiv.org/abs/2109.01377
作者:Xuetong Wu,Jonathan H. Manton,Uwe Aickelin,Jingge Zhu 机构: Department of Electronic and Electrical Engineering, Department of Computing and Information Systems, University of Melbourne, Australia 备注:45 pages, 12 figures 摘要:迁移学习是一种机器学习范式,其中来自一个问题的知识被用来解决一个新的但相关的问题。一方面,一项任务中的知识可能对解决相关任务有用。另一方面,人们也认识到,如果执行不当,迁移学习算法实际上会损害学习性能,而不是改善学习性能,通常称为负迁移。在本文中,我们从贝叶斯的角度研究迁移学习,其中使用了参数统计模型。具体来说,我们研究了迁移学习问题的三种变体:瞬时、在线和时变迁移学习。对于每个问题,我们定义了一个合适的目标函数,并使用信息论量提供学习性能的精确表达式或上界,当样本量变大时,这允许简单而明确的描述。此外,示例表明,即使对于小样本量,导出的边界也是准确的。获得的边界为我们的公式中先验知识对迁移学习的影响提供了有价值的见解。特别是,我们正式描述了负转移发生的条件。最后,我们设计了两种适合实际应用的(在线)迁移学习算法。具体来说,一种算法不需要参数假设,从而将我们的结果扩展到更一般的模型。我们用真实数据集证明了算法的有效性,特别是当源数据和目标数据具有很强的相似性时。 摘要:Transfer learning is a machine learning paradigm where knowledge from one problem is utilized to solve a new but related problem. On the one hand, it is conceivable that knowledge from one task could be useful for solving a related task. On the other hand, it is also recognized that if not executed properly, transfer learning algorithms can in fact impair the learning performance instead of improving it - commonly known as negative transfer. In this paper, we study transfer learning from a Bayesian perspective, where a parametric statistical model is used. Specifically, we study three variants of transfer learning problems, instantaneous, online, and time-variant transfer learning. For each problem, we define an appropriate objective function, and provide either exact expressions or upper bounds on the learning performance using information-theoretic quantities, which allow simple and explicit characterizations when the sample size becomes large. Furthermore, examples show that the derived bounds are accurate even for small sample sizes. The obtained bounds give valuable insights on the effect of prior knowledge for transfer learning in our formulation. In particular, we formally characterize the conditions under which negative transfer occurs. Lastly, we devise two (online) transfer learning algorithms that are amenable to practical implementations. Specifically, one algorithm does not require the parametric assumption, thus extending our results to more general models. We demonstrate the effectiveness of our algorithms with real data set, especially when the source and target data have a strong similarity.
强化学习(4篇)
【1】 Multi-agent Natural Actor-critic Reinforcement Learning Algorithms 标题:多智能体自然行动者-批评型强化学习算法 链接:https://arxiv.org/abs/2109.01654
作者:Prashant Trivedi,Nandyala Hemachandra 机构:Industrial Engineering and Operations Research, Indian Institute of Technology Bombay India 备注:38 pages 摘要:单智能体和多智能体行动者批评算法都是一类重要的强化学习算法。在这项工作中,我们提出了三种完全分散的多智能体自然行动者-批评家(MAN)算法。代理人的目标是集体学习一个联合策略,使这些代理人的平均长期回报之和最大化。在没有中央控制器的情况下,代理通过时变通信网络将信息传递给其邻居,同时保护隐私。我们证明了所有3个MAN算法都收敛到了对应于演员更新的ODE的全局渐近稳定点;它们使用线性函数近似。我们使用Fisher信息矩阵来获得自然梯度。Fisher信息矩阵捕获连续迭代中策略之间Kullback-Leibler(KL)散度的曲率。我们还证明了连续迭代策略之间KL散度的梯度与目标函数的梯度成正比。我们的MAN算法确实使用了目标函数梯度的这种表示。在Fisher信息矩阵的某些条件下,我们证明了在每次迭代中,通过MAN算法得到的最优值都比使用标准梯度的multi-agent-actor-critic(MAAC)算法得到的最优值要好。为了验证我们提出的算法的有效性,我们在双车道交通网络上实现了所有3个MAN算法,以减少平均网络拥塞。我们观察到2人算法的平均拥塞减少了近25%;另一种算法的平均拥塞与MAAC算法是一致的。我们还考虑了一个通用的15代理MARL;MAN算法的性能与MAAC算法一样好。我们将MAN算法更好的性能归因于它们使用了上述表示。 摘要:Both single-agent and multi-agent actor-critic algorithms are an important class of Reinforcement Learning algorithms. In this work, we propose three fully decentralized multi-agent natural actor-critic (MAN) algorithms. The agents' objective is to collectively learn a joint policy that maximizes the sum of averaged long-term returns of these agents. In the absence of a central controller, agents communicate the information to their neighbors via a time-varying communication network while preserving privacy. We prove the convergence of all the 3 MAN algorithms to a globally asymptotically stable point of the ODE corresponding to the actor update; these use linear function approximations. We use the Fisher information matrix to obtain the natural gradients. The Fisher information matrix captures the curvature of the Kullback-Leibler (KL) divergence between polices at successive iterates. We also show that the gradient of this KL divergence between policies of successive iterates is proportional to the objective function's gradient. Our MAN algorithms indeed use this emph{representation} of the objective function's gradient. Under certain conditions on the Fisher information matrix, we prove that at each iterate, the optimal value via MAN algorithms can be better than that of the multi-agent actor-critic (MAAC) algorithm using the standard gradients. To validate the usefulness of our proposed algorithms, we implement all the 3 MAN algorithms on a bi-lane traffic network to reduce the average network congestion. We observe an almost 25% reduction in the average congestion in 2 MAN algorithms; the average congestion in another MAN algorithm is on par with the MAAC algorithm. We also consider a generic 15 agent MARL; the performance of the MAN algorithms is again as good as the MAAC algorithm. We attribute the better performance of the MAN algorithms to their use of the above representation.
【2】 Efficient Communication in Multi-Agent Distributed Reinforcement Learning 标题:多Agent分布式强化学习中的高效通信 链接:https://arxiv.org/abs/2109.01417
作者:Daniel Jarne Ornia,Manuel Mazo Jr 机构: Delft University of TechnologyMANUEL MAZO JR, Delft University of TechnologyWe present in this work an approach to reduce the communication of information needed on a multi-agent learning system inspired byEvent Triggered Control (ETC) techniques 摘要:在这项工作中,我们提出了一种方法,以减少多智能体学习系统所需的信息交流,其灵感来自事件触发控制(ETC)技术。我们考虑在马尔可夫决策过程(MDP)上的分布式Q-学习问题的基线场景。根据基于事件的方法,N个代理探索MDP,并仅在必要时将经验传达给中心学习者,该学习者执行参与者Q函数的更新。我们分析了相对于常规Q-学习算法保留的收敛保证,并给出了实验结果,表明基于事件的通信在此类分布式系统中大大降低了数据传输速率。此外,我们还讨论了这些基于事件的方法对所研究的学习过程有什么影响(期望的和不期望的),以及它们如何应用于更复杂的多智能体学习系统。 摘要:We present in this work an approach to reduce the communication of information needed on a multi-agent learning system inspired by Event Triggered Control (ETC) techniques. We consider a baseline scenario of a distributed Q-learning problem on a Markov Decision Process (MDP). Following an event-based approach, N agents explore the MDP and communicate experiences to a central learner only when necessary, which performs updates of the actor Q functions. We analyse the convergence guarantees retained with respect to a regular Q-learning algorithm, and present experimental results showing that event-based communication results in a substantial reduction of data transmission rates in such distributed systems. Additionally, we discuss what effects (desired and undesired) these event-based approaches have on the learning processes studied, and how they can be applied to more complex multi-agent learning systems.
【3】 Provably Safe Model-Based Meta Reinforcement Learning: An Abstraction-Based Approach 标题:可证明安全的基于模型的元强化学习:一种基于抽象的方法 链接:https://arxiv.org/abs/2109.01255
作者:Xiaowu Sun,Wael Fatnassi,Ulices Santa Cruz,Yasser Shoukry 机构:UniversityofCalifornia 摘要:传统的强化学习侧重于设计能够执行一项任务的代理,而元学习旨在解决设计代理的问题,这些代理可以概括为在设计或训练这些代理时未考虑的不同任务(例如,环境、障碍和目标)。本着这一精神,在本文中,我们考虑的问题,训练一个可证明安全的神经网络(NN)控制器的不确定非线性动力系统,可以推广到新的任务,不存在于训练数据,同时保持强大的安全保证。我们的方法是在训练阶段学习一组神经网络控制器。当任务在运行时可用时,我们的框架将仔细选择这些NN控制器的子集,并将它们组合成最终的NN控制器。我们的方法的关键是能够计算非线性动力系统的有限状态抽象。该抽象模型捕获了闭环系统在所有可能的神经网络权重下的行为,并用于训练神经网络,并在任务可用时组合它们。我们提供了理论上的保证来控制结果NN的正确性。我们评估了在训练数据中不存在的杂乱环境中控制轮式机器人的方法。 摘要:While conventional reinforcement learning focuses on designing agents that can perform one task, meta-learning aims, instead, to solve the problem of designing agents that can generalize to different tasks (e.g., environments, obstacles, and goals) that were not considered during the design or the training of these agents. In this spirit, in this paper, we consider the problem of training a provably safe Neural Network (NN) controller for uncertain nonlinear dynamical systems that can generalize to new tasks that were not present in the training data while preserving strong safety guarantees. Our approach is to learn a set of NN controllers during the training phase. When the task becomes available at runtime, our framework will carefully select a subset of these NN controllers and compose them to form the final NN controller. Critical to our approach is the ability to compute a finite-state abstraction of the nonlinear dynamical system. This abstract model captures the behavior of the closed-loop system under all possible NN weights, and is used to train the NNs and compose them when the task becomes available. We provide theoretical guarantees that govern the correctness of the resulting NN. We evaluated our approach on the problem of controlling a wheeled robot in cluttered environments that were not present in the training data.
【4】 Multi-Agent Inverse Reinforcement Learning: Suboptimal Demonstrations and Alternative Solution Concepts 标题:多智能体逆强化学习:次优演示和替代解概念 链接:https://arxiv.org/abs/2109.01178
作者:Sage Bergerson 机构:Stanford Existential Risk Initiative, Stanford University, Stanford, CA 摘要:多智能体反向强化学习(MIRL)可用于学习社会环境中智能体的奖励函数。为了模拟现实的社会动态,MIRL方法必须考虑人类的次优推理和行为。传统的博弈论形式提供了可计算的行为模型,但假设代理具有不切实际的认知能力。本研究确定并比较了MIRL方法中的机制,即a)在agent决策中处理噪声、偏差和启发式,b)模型现实均衡解概念。系统地审查了MIRL研究,以确定应对这些挑战的解决方案。这些研究的方法和结果基于绩效准确性、效率和描述质量等因素进行分析和比较。我们发现,在MIRL中处理噪声、偏差和启发式的主要方法是将最大熵(MaxEnt)IRL扩展到多代理设置。我们还发现许多成功的解概念是传统纳什均衡(NE)的推广。这些解包括相关平衡点、logistic随机最佳响应平衡点和熵正则化平均场。使用递归推理或更新的方法也表现良好,包括反馈NE和归档多代理对抗IRL。在单智能体IRL中成功地建模了特定的偏见和启发式,在MIRL中使用心理理论方法获得了有希望的结果,这意味着建模特定的偏见和启发式可能是有用的。在确定的备选解决方案概念中的灵活性和无偏推理表明,具有递归和广义特征的解决方案概念可能在模拟现实社会互动方面表现良好。 摘要:Multi-agent inverse reinforcement learning (MIRL) can be used to learn reward functions from agents in social environments. To model realistic social dynamics, MIRL methods must account for suboptimal human reasoning and behavior. Traditional formalisms of game theory provide computationally tractable behavioral models, but assume agents have unrealistic cognitive capabilities. This research identifies and compares mechanisms in MIRL methods which a) handle noise, biases and heuristics in agent decision making and b) model realistic equilibrium solution concepts. MIRL research is systematically reviewed to identify solutions for these challenges. The methods and results of these studies are analyzed and compared based on factors including performance accuracy, efficiency, and descriptive quality. We found that the primary methods for handling noise, biases and heuristics in MIRL were extensions of Maximum Entropy (MaxEnt) IRL to multi-agent settings. We also found that many successful solution concepts are generalizations of the traditional Nash Equilibrium (NE). These solutions include the correlated equilibrium, logistic stochastic best response equilibrium and entropy regularized mean field NE. Methods which use recursive reasoning or updating also perform well, including the feedback NE and archive multi-agent adversarial IRL. Success in modeling specific biases and heuristics in single-agent IRL and promising results using a Theory of Mind approach in MIRL imply that modeling specific biases and heuristics may be useful. Flexibility and unbiased inference in the identified alternative solution concepts suggest that a solution concept which has both recursive and generalized characteristics may perform well at modeling realistic social interactions.
医学相关(1篇)
【1】 Investigate the Correlation of Breast Cancer Dataset using Different Clustering Technique 标题:不同聚类技术在乳腺癌数据集相关性研究中的应用 链接:https://arxiv.org/abs/2109.01538
作者:Somenath Chakraborty,Beddhu Murali 机构:School of Computing Sciences and Computer Engineering, The University of southern Mississippi, Hattiesburg, USA 摘要:本文的目的是探索在无需事先训练模型的无监督学习环境下分析乳腺癌数据集的方法。本文研究了聚类技术和预处理的不同方法。这一深入分析构建了足迹,可进一步用于设计最稳健和准确的医疗预后系统。本文还重点讨论了数据点与不同标准基准技术的相关性。关键词:乳腺癌数据集,聚类技术霍普金斯统计,K-均值聚类,K-中值或中值分割(PAM) 摘要:The objectives of this paper are to explore ways to analyze breast cancer dataset in the context of unsupervised learning without prior training model. The paper investigates different ways of clustering techniques as well as preprocessing. This in-depth analysis builds the footprint which can further use for designing a most robust and accurate medical prognosis system. This paper also give emphasis on correlations of data points with different standard benchmark techniques. Keywords: Breast cancer dataset, Clustering Technique Hopkins Statistic, K-means Clustering, k-medoids or partitioning around medoids (PAM)
蒸馏|知识提取(1篇)
【1】 Towards extraction of orthogonal and parsimonious non-linear modes from turbulent flows 标题:从湍流中提取正交简约非线性模态的研究 链接:https://arxiv.org/abs/2109.01514
作者:Hamidreza Eivazi,Soledad Le Clainche,Sergio Hoyas,Ricardo Vinuesa 机构:FLOW, Engineering Mechanics, KTH Royal Institute of Technology, SE-, Stockholm, Sweden, School of Aerospace Engineering, Universidad Polit´ecnica de Madrid, Madrid, Spain, Instituto Universitario de Matem´atica Pura y Aplicada, Universitat Politecnica de 摘要:我们提出了一种深度概率神经网络结构,用于从高保真湍流流场数据中学习一组最小和近似正交的非线性模式,这些数据对于流量分析、降阶建模和流量控制非常有用。我们的方法基于$beta$-变分自动编码器($beta$-VAEs)和卷积神经网络(CNN),它允许我们从多尺度湍流中提取非线性模式,同时鼓励学习独立的潜在变量并惩罚潜在向量的大小。此外,我们还介绍了一种根据VAE模式对重建的贡献对其进行排序的算法。我们将此方法应用于简化城市环境中湍流的非线性模式分解,其中流场数据是基于高分辨率大涡模拟(LESs)获得的。我们证明,通过限制潜在空间的形状,可以激发正交性并提取一组足以进行高质量重建的简约模式。我们的结果表明,该方法在基于线性理论的分解重建中具有良好的性能。此外,我们还将我们的方法与现有的基于声发射的模型进行了比较。我们展示了我们的方法在提取可能导致可解释性的近正交模式方面的能力。 摘要:We propose a deep probabilistic-neural-network architecture for learning a minimal and near-orthogonal set of non-linear modes from high-fidelity turbulent-flow-field data useful for flow analysis, reduced-order modeling, and flow control. Our approach is based on $beta$-variational autoencoders ($beta$-VAEs) and convolutional neural networks (CNNs), which allow us to extract non-linear modes from multi-scale turbulent flows while encouraging the learning of independent latent variables and penalizing the size of the latent vector. Moreover, we introduce an algorithm for ordering VAE-based modes with respect to their contribution to the reconstruction. We apply this method for non-linear mode decomposition of the turbulent flow through a simplified urban environment, where the flow-field data is obtained based on well-resolved large-eddy simulations (LESs). We demonstrate that by constraining the shape of the latent space, it is possible to motivate the orthogonality and extract a set of parsimonious modes sufficient for high-quality reconstruction. Our results show the excellent performance of the method in the reconstruction against linear-theory-based decompositions. Moreover, we compare our method with available AE-based models. We show the ability of our approach in the extraction of near-orthogonal modes that may lead to interpretability.
聚类(1篇)
【1】 J-Score: A Robust Measure of Clustering Accuracy 标题:J-SCORE:一种稳健的聚类精度度量 链接:https://arxiv.org/abs/2109.01306
作者:Navid Ahmadinejad,Li Liu 机构: College of Health Solutions, Arizona State University, Phoenix, AZ, USA, Biodesign Institute, Arizona State University, Phoenix, AZ, USA, Department of Neurology, Scottsdale, AZ , USA, Corresponding Author: 摘要:背景。聚类分析通过将数据集划分为不相交的簇来发现隐藏的结构。评估聚类结果优劣的稳健精度度量对于算法开发和模型诊断至关重要。当前聚类精度度量的常见问题包括忽略不匹配的聚类、对过度聚类的偏见、基线不稳定和难以解释。在这项研究中,我们提出了一种新的准确度度量,J-score,解决了这些问题。方法。给定一个具有已知类标签的数据集,J-score量化了通过聚类分析生成的假设集群恢复真实类的程度。它从双向集合匹配开始,基于Jaccard索引识别真实类和假设类之间的对应关系。然后计算两个Jaccard指数的加权和,衡量从类到簇的调节,反之亦然。最后的J分数是两个加权和的调和平均值。结果。通过模拟研究,我们评估了J-score的性能,并与现有的测量方法进行了比较。我们的结果表明,J-score在区分只因不匹配的簇而不同的划分结构、奖励正确的类数推断、解决对过度簇的偏见以及具有相对稳定的基线方面是有效的。其计算的简单性使解释变得简单明了。这是一个有价值的工具,可以补充其他精度度量。我们发布了一个实现该算法的R/jScore包。 摘要:Background. Clustering analysis discovers hidden structures in a data set by partitioning them into disjoint clusters. Robust accuracy measures that evaluate the goodness of clustering results are critical for algorithm development and model diagnosis. Common problems of current clustering accuracy measures include overlooking unmatched clusters, biases towards excessive clusters, unstable baselines, and difficult interpretation. In this study, we presented a novel accuracy measure, J-score, that addresses these issues. Methods. Given a data set with known class labels, J-score quantifies how well the hypothetical clusters produced by clustering analysis recover the true classes. It starts with bidirectional set matching to identify the correspondence between true classes and hypothetical clusters based on Jaccard index. It then computes two weighted sums of Jaccard indices measuring the reconciliation from classes to clusters and vice versa. The final J-score is the harmonic mean of the two weighted sums. Results. Via simulation studies, we evaluated the performance of J-score and compared with existing measures. Our results show that J-score is effective in distinguishing partition structures that differ only by unmatched clusters, rewarding correct inference of class numbers, addressing biases towards excessive clusters, and having a relatively stable baseline. The simplicity of its calculation makes the interpretation straightforward. It is a valuable tool complementary to other accuracy measures. We released an R/jScore package implementing the algorithm.
自动驾驶|车辆|车道检测等(1篇)
【1】 Is Machine Learning Ready for Traffic Engineering Optimization? 标题:机器学习为流量工程优化做好准备了吗? 链接:https://arxiv.org/abs/2109.01445
作者:Guillermo Bernárdez,José Suárez-Varela,Albert López,Bo Wu,Shihan Xiao,Xiangle Cheng,Pere Barlet-Ros,Albert Cabellos-Aparicio 机构:∗ Barcelona Neural Networking Center, Universitat Politècnica de Catalunya, Barcelona, Spain, †Network Technology Lab., Huawei Technologies Co., Ltd., Beijing, China 备注:To appear at IEEE ICNP 2021 摘要:流量工程(TE)是互联网的基本组成部分。在本文中,我们分析了现代机器学习(ML)方法是否可以用于TE优化。我们通过比较分析ML的最新技术和TE的最新技术来解决这个悬而未决的问题。为此,我们首先提出了一种新的分布式TE系统,该系统利用了ML的最新进展。我们的系统实现了一种新的体系结构,该体系结构结合了多代理强化学习(MARL)和图形神经网络(GNN),以最小化网络拥塞。在我们的评估中,我们将MARL GNN系统与DEFO进行了比较,DEFO是一种基于约束编程的网络优化器,代表了TE的最新技术。我们的实验结果表明,所提出的MARL GNN解决方案在包括三种真实网络拓扑在内的各种网络场景中实现了与DEFO同等的性能。同时,我们证明MARL GNN可以显著缩短执行时间(从使用笛福的分钟到使用我们的解决方案的几秒钟)。 摘要:Traffic Engineering (TE) is a basic building block of the Internet. In this paper, we analyze whether modern Machine Learning (ML) methods are ready to be used for TE optimization. We address this open question through a comparative analysis between the state of the art in ML and the state of the art in TE. To this end, we first present a novel distributed system for TE that leverages the latest advancements in ML. Our system implements a novel architecture that combines Multi-Agent Reinforcement Learning (MARL) and Graph Neural Networks (GNN) to minimize network congestion. In our evaluation, we compare our MARL GNN system with DEFO, a network optimizer based on Constraint Programming that represents the state of the art in TE. Our experimental results show that the proposed MARL GNN solution achieves equivalent performance to DEFO in a wide variety of network scenarios including three real-world network topologies. At the same time, we show that MARL GNN can achieve significant reductions in execution time (from the scale of minutes with DEFO to a few seconds with our solution).
联邦学习|隐私保护|加密(2篇)
【1】 Ground-Assisted Federated Learning in LEO Satellite Constellations 标题:LEO卫星星座中的地面辅助联邦学习 链接:https://arxiv.org/abs/2109.01348
作者:Nasrin Razmi,Bho Matthiesen,Armin Dekorsy,Petar Popovski 备注:Submitted to IEEE Wireless Communications Letters 摘要:在低地球轨道(LEO)巨型星座中,存在相关的用例,例如基于卫星成像的推理,其中大量卫星在不共享本地数据集的情况下协作训练机器学习模型。为了解决这个问题,我们提出了一套新的基于联邦学习(FL)的算法。我们的方法与标准FL算法有很大不同,因为它考虑了LEO星座固有的可预测连接模式。大量数值计算表明,该方法具有较快的收敛速度和良好的渐近检验精度。特别是,实现的测试精度在集中式解决方案的96%到99.6%之间,并且与最先进的异步FL方法相比,该算法具有更少的超参数可调。 摘要:In Low Earth Orbit (LEO) mega constellations, there are relevant use cases, such as inference based on satellite imaging, in which a large number of satellites collaboratively train a machine learning model without sharing their local data sets. To address this problem, we propose a new set of algorithms based of Federated learning (FL). Our approach differs substantially from the standard FL algorithms, as it takes into account the predictable connectivity patterns that are immanent to the LEO constellations. Extensive numerical evaluations highlight the fast convergence speed and excellent asymptotic test accuracy of the proposed method. In particular, the achieved test accuracy is within 96% to 99.6% of the centralized solution and the proposed algorithm has less hyperparameters to tune than state-of-the-art asynchronous FL methods.
【2】 Statistical Estimation and Inference via Local SGD in Federated Learning 标题:联合学习中基于局部SGD的统计估计和推理 链接:https://arxiv.org/abs/2109.01326
作者:Xiang Li,Jiadong Liang,Xiangyu Chang,Zhihua Zhang 摘要:联邦学习(FL)使大量边缘计算设备(如手机)在不共享数据的情况下联合学习全局模型。在FL中,数据以高度异构的分散方式生成。本文研究如何在联邦环境下进行统计估计和推理。我们分析了所谓的局部SGD,这是一种使用间歇通信来提高通信效率的多轮估计过程。我们首先建立了一个{it泛函中心极限定理},证明了局部SGD的平均迭代弱收敛于重标度布朗运动。接下来,我们将提供两种迭代推理方法:{it plug-in}和{it random scaling}。随机标度通过使用整个局部SGD路径上的信息来构造用于推理的渐近关键统计量。这两种方法都是有效的通信方法,适用于在线数据。我们的理论和实证结果表明,本地SGD同时实现了统计效率和通信效率。 摘要:Federated Learning (FL) makes a large amount of edge computing devices (e.g., mobile phones) jointly learn a global model without data sharing. In FL, data are generated in a decentralized manner with high heterogeneity. This paper studies how to perform statistical estimation and inference in the federated setting. We analyze the so-called Local SGD, a multi-round estimation procedure that uses intermittent communication to improve communication efficiency. We first establish a {it functional central limit theorem} that shows the averaged iterates of Local SGD weakly converge to a rescaled Brownian motion. We next provide two iterative inference methods: the {it plug-in} and the {it random scaling}. Random scaling constructs an asymptotically pivotal statistic for inference by using the information along the whole Local SGD path. Both the methods are communication efficient and applicable to online data. Our theoretical and empirical results show that Local SGD simultaneously achieves both statistical efficiency and communication efficiency.
推理|分析|理解|解释(2篇)
【1】 CX-ToM: Counterfactual Explanations with Theory-of-Mind for Enhancing Human Trust in Image Recognition Models 标题:CX-TOM:图像识别模型中增强人类信任度的心理理论反事实解释 链接:https://arxiv.org/abs/2109.01401
作者:Arjun R. Akula,Keze Wang,Changsong Liu,Sari Saba-Sadiya,Hongjing Lu,Sinisa Todorovic,Joyce Chai,Song-Chun Zhu 机构: Oregon State University, University of Michigan 备注:Accepted by iScience Cell Press Journal 2021. arXiv admin note: text overlap with arXiv:1909.06907 摘要:我们提出CX-ToM(心灵理论反事实解释的简称),一个新的可解释人工智能(XAI)框架,用于解释深度卷积神经网络(CNN)做出的决策。与XAI中当前将解释生成为单发响应的方法不同,我们将解释设置为机器和人类用户之间的迭代通信过程,即对话。更具体地说,我们的CX-ToM框架通过调解机器和人类用户的思维差异,在对话中生成一系列解释。为了做到这一点,我们使用心理理论(ToM),它帮助我们明确地建模人类的意图、人类推断出的机器思维以及机器推断出的人类思维。此外,大多数最先进的XAI框架都提供了基于注意(或热图)的解释。在我们的工作中,我们发现这些基于注意力的解释不足以增加人类对潜在CNN模型的信任。在CX ToM中,我们使用称为“断层线”的反事实解释,我们定义如下:给定CNN分类模型M预测c_pred类的输入图像I,断层线识别最小语义级别特征(例如斑马条纹、狗尖耳朵),称为可解释概念,需要添加到I或从I中删除,以便将I by M的分类类别更改为另一个指定的类别c_alt。我们认为,由于CX-ToM解释的迭代性、概念性和反事实性,对于专家和非专家用户来说,我们的框架是实用的,更自然地理解复杂深度学习模型的内部工作原理。大量的定量和定性实验验证了我们的假设,表明我们的CX-ToM显著优于最先进的可解释人工智能模型。 摘要:We propose CX-ToM, short for counterfactual explanations with theory-of mind, a new explainable AI (XAI) framework for explaining decisions made by a deep convolutional neural network (CNN). In contrast to the current methods in XAI that generate explanations as a single shot response, we pose explanation as an iterative communication process, i.e. dialog, between the machine and human user. More concretely, our CX-ToM framework generates sequence of explanations in a dialog by mediating the differences between the minds of machine and human user. To do this, we use Theory of Mind (ToM) which helps us in explicitly modeling human's intention, machine's mind as inferred by the human as well as human's mind as inferred by the machine. Moreover, most state-of-the-art XAI frameworks provide attention (or heat map) based explanations. In our work, we show that these attention based explanations are not sufficient for increasing human trust in the underlying CNN model. In CX-ToM, we instead use counterfactual explanations called fault-lines which we define as follows: given an input image I for which a CNN classification model M predicts class c_pred, a fault-line identifies the minimal semantic-level features (e.g., stripes on zebra, pointed ears of dog), referred to as explainable concepts, that need to be added to or deleted from I in order to alter the classification category of I by M to another specified class c_alt. We argue that, due to the iterative, conceptual and counterfactual nature of CX-ToM explanations, our framework is practical and more natural for both expert and non-expert users to understand the internal workings of complex deep learning models. Extensive quantitative and qualitative experiments verify our hypotheses, demonstrating that our CX-ToM significantly outperforms the state-of-the-art explainable AI models.
【2】 On the Accuracy of Analog Neural Network Inference Accelerators 标题:关于模拟神经网络推理加速器的精度问题 链接:https://arxiv.org/abs/2109.01262
作者:T. Patrick Xiao,Ben Feinberg,Christopher H. Bennett,Venkatraman Prabhakar,Prashant Saxena,Vineet Agrawal,Sapan Agarwal,Matthew J. Marinella 机构:†Sandia National Laboratories, ∗Infineon Technologies 摘要:最近,专门的加速器作为一种降低神经网络推理功耗的方法引起了人们的关注。一类很有前途的加速器利用非易失性存储器阵列存储权重并在阵列内执行$textit{in situ}$模拟计算。虽然之前的工作探索了模拟加速器的设计空间,以优化性能和能效,但很少对这些加速器的精度进行严格评估。这项工作显示了架构设计决策,特别是将神经网络参数映射到模拟存储单元时,如何影响推理精度。在ImageNet上使用ResNet50进行评估时,当硬件中的模拟量与网络中的权重成比例时,系统对模拟非理想性(单元编程错误、模数转换器分辨率和阵列寄生电阻)的恢复能力都会提高。此外,与先前工作的假设相反,通过将权重完全存储为模拟量,而不是将权重比特分散在多个设备上(通常称为比特切片),可以实现对单元不精确性的几乎等效的恢复能力。通过利用比例性,模拟系统设计者可以自由地将硬件精度与算法的需要相匹配,而不是试图保证中间结果的精度水平与等效数字加速器相同。这最终会产生一种更精确、对模拟误差更鲁棒、更节能的模拟加速器。 摘要:Specialized accelerators have recently garnered attention as a method to reduce the power consumption of neural network inference. A promising category of accelerators utilizes nonvolatile memory arrays to both store weights and perform $textit{in situ}$ analog computation inside the array. While prior work has explored the design space of analog accelerators to optimize performance and energy efficiency, there is seldom a rigorous evaluation of the accuracy of these accelerators. This work shows how architectural design decisions, particularly in mapping neural network parameters to analog memory cells, influence inference accuracy. When evaluated using ResNet50 on ImageNet, the resilience of the system to analog non-idealities - cell programming errors, analog-to-digital converter resolution, and array parasitic resistances - all improve when analog quantities in the hardware are made proportional to the weights in the network. Moreover, contrary to the assumptions of prior work, nearly equivalent resilience to cell imprecision can be achieved by fully storing weights as analog quantities, rather than spreading weight bits across multiple devices, often referred to as bit slicing. By exploiting proportionality, analog system designers have the freedom to match the precision of the hardware to the needs of the algorithm, rather than attempting to guarantee the same level of precision in the intermediate results as an equivalent digital accelerator. This ultimately results in an analog accelerator that is more accurate, more robust to analog errors, and more energy-efficient.
分类|识别(1篇)
【1】 Empirical Study of Named Entity Recognition Performance Using Distribution-aware Word Embedding 标题:分布感知词嵌入对命名实体识别性能的实证研究 链接:https://arxiv.org/abs/2109.01636
作者:Xin Chen,Qi Zhao,Xinyang Liu 机构:ISE Department, University of Illinois, at Urbana-Champaign 摘要:随着深度学习技术的快速发展,命名实体识别(NER)在信息抽取任务中的地位越来越重要。NER任务面临的最大困难是即使在网元和文档类型不熟悉的情况下也要保持可检测性。考虑到特定性信息可能包含单词的潜在含义并生成与单词嵌入相关的语义特征,我们开发了一种分布感知的单词嵌入,并在NER框架中实现了三种不同的方法来利用分布信息。结果表明,如果在现有的NER方法中加入单词特异性,NER的性能将得到改善。 摘要:With the fast development of Deep Learning techniques, Named Entity Recognition (NER) is becoming more and more important in the information extraction task. The greatest difficulty that the NER task faces is to keep the detectability even when types of NE and documents are unfamiliar. Realizing that the specificity information may contain potential meanings of a word and generate semantic-related features for word embedding, we develop a distribution-aware word embedding and implement three different methods to make use of the distribution information in a NER framework. And the result shows that the performance of NER will be improved if the word specificity is incorporated into existing NER methods.
优化|敛散性(1篇)
【1】 Pareto-Optimal Learning-Augmented Algorithms for Online Conversion Problems 标题:在线转换问题的Pareto最优学习增广算法 链接:https://arxiv.org/abs/2109.01556
作者:Bo Sun,Russell Lee,Mohammad Hajiesmaili,Adam Wierman,Danny H. K. Tsang 机构:Danny H.K. Tsang¶ 摘要:本文利用机器学习预测设计在线转换问题的竞争算法,目的是在预测准确(即一致性)时提高竞争比率,同时保证最坏情况下的竞争比率,而不管预测质量如何(即鲁棒性)。我们将积分和分数转换问题(也称为1-max搜索和单向交易问题)的算法设计统一为一类基于在线阈值的算法(OTA)。通过将预测纳入OTA的设计中,我们实现了一致性和鲁棒性的帕累托最优权衡,即,没有在线算法能够为鲁棒性保证提供更好的一致性保证。我们通过比特币转换的数值实验证明了OTA的性能。 摘要:This paper leverages machine-learned predictions to design competitive algorithms for online conversion problems with the goal of improving the competitive ratio when predictions are accurate (i.e., consistency), while also guaranteeing a worst-case competitive ratio regardless of the prediction quality (i.e., robustness). We unify the algorithmic design of both integral and fractional conversion problems, which are also known as the 1-max-search and one-way trading problems, into a class of online threshold-based algorithms (OTA). By incorporating predictions into design of OTA, we achieve the Pareto-optimal trade-off of consistency and robustness, i.e., no online algorithm can achieve a better consistency guarantee given for a robustness guarantee. We demonstrate the performance of OTA using numerical experiments on Bitcoin conversion.
预测|估计(3篇)
【1】 MACEst: The reliable and trustworthy Model Agnostic Confidence Estimator 标题:MACEST:可靠可信的不可知可信性模型置信度估计器 链接:https://arxiv.org/abs/2109.01531
作者:Rhys Green,Matthew Rowe,Alberto Polleri 机构:Gravity Exploration Institute, Cardiff University, Cardiff, UK, Oracle AI Apps, London, UK, Editor: 摘要:可靠的置信度估计对于任何机器学习模型真正有用都非常重要。在本文中,我们认为任何基于标准机器学习点预测算法的置信度估计都存在根本性缺陷,并且在具有大量认知不确定性的情况下都可能不可信。为了解决这些问题,我们提出了MACEst,一种模型不可知的置信度估计器,它提供了可靠和可信的置信度估计。该算法不同于目前的方法,它将置信度独立估计为一个局部量,明确地考虑了任意和认知不确定性。该方法不同于使用全局点预测模型作为置信度估计起点的标准校准方法。 摘要:Reliable Confidence Estimates are hugely important for any machine learning model to be truly useful. In this paper, we argue that any confidence estimates based upon standard machine learning point prediction algorithms are fundamentally flawed and under situations with a large amount of epistemic uncertainty are likely to be untrustworthy. To address these issues, we present MACEst, a Model Agnostic Confidence Estimator, which provides reliable and trustworthy confidence estimates. The algorithm differs from current methods by estimating confidence independently as a local quantity which explicitly accounts for both aleatoric and epistemic uncertainty. This approach differs from standard calibration methods that use a global point prediction model as a starting point for the confidence estimate.
【2】 Building Interpretable Models for Business Process Prediction using Shared and Specialised Attention Mechanisms 标题:使用共享和专门的注意机制构建业务流程预测的可解释模型 链接:https://arxiv.org/abs/2109.01419
作者:Bemali Wickramanayake,Zhipeng He,Chun Ouyang,Catarina Moreira,Yue Xu,Renuka Sindhgatta 机构:Queensland University of Technology, Brisbane, Australia, IBM Research, Bangalore, India 备注:25 pages, 11 figures, 5 tables 摘要:在本文中,我们通过构建可解释的模型来解决预测过程分析中的“黑箱”问题,该模型能够告知预测是什么以及为什么。预测过程分析是一门新兴学科,致力于在现代组织中提供业务过程智能。它使用事件日志(以多维序列数据的形式捕获流程执行跟踪)作为训练预测模型的关键输入。这些预测模型通常建立在深度学习技术的基础上,可用于预测业务流程执行的未来状态。我们应用注意机制来实现模型的可解释性。我们建议i)两种类型的注意:事件注意用于捕捉特定过程事件对预测的影响,属性注意用于揭示事件的哪些属性影响预测;和ii)两种注意机制:共享注意机制和专门注意机制,以反映在何时对单个输入特征(专门)构建属性注意或使用所有输入特征向量的串联特征张量(共享)时的不同设计决策。这导致了两种不同的基于注意的模型,它们都是可解释的模型,将可解释性直接纳入过程预测模型的结构中。我们使用真实数据集对提出的模型进行实验评估,并对模型的准确性和可解释性进行比较分析,从评估和分析结果中得出见解。 摘要:In this paper, we address the "black-box" problem in predictive process analytics by building interpretable models that are capable to inform both what and why is a prediction. Predictive process analytics is a newly emerged discipline dedicated to providing business process intelligence in modern organisations. It uses event logs, which capture process execution traces in the form of multi-dimensional sequence data, as the key input to train predictive models. These predictive models, often built upon deep learning techniques, can be used to make predictions about the future states of business process execution. We apply attention mechanism to achieve model interpretability. We propose i) two types of attentions: event attention to capture the impact of specific process events on a prediction, and attribute attention to reveal which attribute(s) of an event influenced the prediction; and ii) two attention mechanisms: shared attention mechanism and specialised attention mechanism to reflect different design decisions in when to construct attribute attention on individual input features (specialised) or using the concatenated feature tensor of all input feature vectors (shared). These lead to two distinct attention-based models, and both are interpretable models that incorporate interpretability directly into the structure of a process predictive model. We conduct experimental evaluation of the proposed models using real-life dataset, and comparative analysis between the models for accuracy and interpretability, and draw insights from the evaluation and analysis results.
【3】 Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters 标题:大规模GPU数据中心深度学习工作负载的表征与预测 链接:https://arxiv.org/abs/2109.01313
作者:Hu,Qinghao,Sun,Peng,Yan,Shengen,Wen,Yonggang,Zhang,Tianwei 机构:SenseTime 备注:This paper has been accepted by the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC21), Nov 14-19, 2021, St. Louis, USA 摘要:现代GPU数据中心对于在研究社区和行业中提供深度学习(DL)模型和服务至关重要。在运营数据中心时,优化资源调度和管理可以带来显著的经济效益。实现这一目标需要对工作特征和用户行为有深入的了解。我们对DL作业和资源管理的特点进行了全面的研究。首先,我们从SenseTime对现实世界中的工作痕迹进行大规模分析。我们从集群、作业和用户的角度得出了一些有趣的结论,这有助于集群系统的设计。其次,我们介绍了一个通用框架,它基于历史数据管理资源。作为案例研究,我们设计了一个准最短服务优先调度服务,它可以将集群范围内的平均作业完成时间减少6.5倍;以及群集节能服务,可将群集的总体利用率提高13%。 摘要:Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design: a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; and a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.
其他神经网络|深度学习|模型|建模(7篇)
【1】 Stochastic Physics-Informed Neural Networks (SPINN): A Moment-Matching Framework for Learning Hidden Physics within Stochastic Differential Equations 标题:随机物理信息神经网络(SPINN):学习随机微分方程隐含物理的矩匹配框架 链接:https://arxiv.org/abs/2109.01621
作者:Jared O'Leary,Joel A. Paulson,Ali Mesbah 机构:Department of Chemical and Biomolecular Engineering, University of California, Berkeley, Berkeley, CA, USA, Department of Chemical and Biomolecular Engineering, The Ohio State, University, Columbus, OH, USA 摘要:随机微分方程(SDE)用于描述各种复杂的随机动力系统。学习SDEs中隐藏的物理对于揭示这些系统的随机和非线性行为的基本理解至关重要。我们提出了一个灵活且可扩展的框架,用于训练深层神经网络,以学习代表SDE中隐藏物理的本构方程。提出的随机物理信息神经网络框架(SPINN)依赖于不确定性传播和矩匹配技术以及最先进的深度学习策略。自旋首先通过SDE的已知结构(即已知物理)传播随机性,以预测随机状态统计矩的时间演化。SPINN通过将预测的时刻与从数据中估计的时刻相匹配来学习隐藏物理的(深层)神经网络表示。利用自动微分和小批量梯度下降的最新进展来建立神经网络的未知参数。我们在三个基准上对SPINN进行了实例演示,并分析了该框架的鲁棒性和数值稳定性。SPINN为系统地揭示具有乘性噪声的多元随机动力系统的隐藏物理提供了一个有希望的新方向。 摘要:Stochastic differential equations (SDEs) are used to describe a wide variety of complex stochastic dynamical systems. Learning the hidden physics within SDEs is crucial for unraveling fundamental understanding of the stochastic and nonlinear behavior of these systems. We propose a flexible and scalable framework for training deep neural networks to learn constitutive equations that represent hidden physics within SDEs. The proposed stochastic physics-informed neural network framework (SPINN) relies on uncertainty propagation and moment-matching techniques along with state-of-the-art deep learning strategies. SPINN first propagates stochasticity through the known structure of the SDE (i.e., the known physics) to predict the time evolution of statistical moments of the stochastic states. SPINN learns (deep) neural network representations of the hidden physics by matching the predicted moments to those estimated from data. Recent advances in automatic differentiation and mini-batch gradient descent are leveraged to establish the unknown parameters of the neural networks. We demonstrate SPINN on three benchmark in-silico case studies and analyze the framework's robustness and numerical stability. SPINN provides a promising new direction for systematically unraveling the hidden physics of multivariate stochastic dynamical systems with multiplicative noise.
【2】 Using Topological Framework for the Design of Activation Function and Model Pruning in Deep Neural Networks 标题:利用拓扑框架设计深层神经网络的激活函数和模型剪枝 链接:https://arxiv.org/abs/2109.01572
作者:Yogesh Kochar,Sunil Kumar Vengalil,Neelam Sinha 机构:Samsung India Research Bangalore, name of organization (of Aff.), Bangalore, India, International Institute of Information Technology 摘要:深度神经网络在计算机视觉、语音识别和自然语言处理等领域的各种任务中取得成功,需要了解训练过程的动态以及训练模型的工作。本文的两个独立贡献是:1)用于更快训练收敛的新激活函数2)不考虑激活函数训练的模型的滤波器的系统修剪。通过改变激活函数,我们分析了训练样本空间在训练过程中被每个连续层变换时的拓扑变换。针对二元分类任务,研究了训练过程中激活函数变化对收敛性的影响。提出了一种新的激活函数,旨在加快分类任务的收敛速度。这里,Betti数用于量化数据的拓扑复杂性。报告了使用MLPs在具有大Betti数(>150)的流行合成二元分类数据集上的实验结果。结果表明,所提出的激活函数导致更快的收敛,所需的时间更少,因子为1.5到2,因为使用所提出的激活函数,层间Betti数减少得更快。所提出的方法在基准图像数据集上得到了验证:fashion MNIST、CIFAR-10和cat vs dog图像,使用CNN。基于实证结果,我们提出了一种新的修剪训练模型的方法。通过消除将数据转换为具有大Betti数的拓扑空间的过滤器,对训练后的模型进行修剪。所有Betti数大于300的过滤器均从每一层移除,且准确度无明显降低。这导致更快的预测时间和减少模型的内存大小。 摘要:Success of deep neural networks in diverse tasks across domains of computer vision, speech recognition and natural language processing, has necessitated understanding the dynamics of training process and also working of trained models. Two independent contributions of this paper are 1) Novel activation function for faster training convergence 2) Systematic pruning of filters of models trained irrespective of activation function. We analyze the topological transformation of the space of training samples as it gets transformed by each successive layer during training, by changing the activation function. The impact of changing activation function on the convergence during training is reported for the task of binary classification. A novel activation function aimed at faster convergence for classification tasks is proposed. Here, Betti numbers are used to quantify topological complexity of data. Results of experiments on popular synthetic binary classification datasets with large Betti numbers(>150) using MLPs are reported. Results show that the proposed activation function results in faster convergence requiring fewer epochs by a factor of 1.5 to 2, since Betti numbers reduce faster across layers with the proposed activation function. The proposed methodology was verified on benchmark image datasets: fashion MNIST, CIFAR-10 and cat-vs-dog images, using CNNs. Based on empirical results, we propose a novel method for pruning a trained model. The trained model was pruned by eliminating filters that transform data to a topological space with large Betti numbers. All filters with Betti numbers greater than 300 were removed from each layer without significant reduction in accuracy. This resulted in faster prediction time and reduced memory size of the model.
【3】 Large-Scale Learning with Fourier Features and Tensor Decompositions 标题:具有傅立叶特征和张量分解的大规模学习 链接:https://arxiv.org/abs/2109.01545
作者:Frederiek Wesel,Kim Batselier 机构:Delft Center for Systems and Control, Delft University of Technology 备注:9 pages, 6 figures 摘要:随机傅立叶特征提供了一种用核方法解决大规模机器学习问题的方法。其缓慢的蒙特卡罗收敛速度激发了对确定性傅里叶特征的研究,其近似误差随频率数呈指数下降。然而,由于它们的张量积结构,这些方法严重受到维数灾难的影响,限制了它们对二维或三维场景的适用性。在我们的方法中,我们通过利用确定性傅立叶特征的张量积结构克服了上述维数灾难,这使我们能够将模型参数表示为低秩张量分解。我们推导了一个单调收敛的块坐标下降算法,对于正则平方损失函数,其样本大小和输入的维数均为线性复杂度,允许使用确定性傅立叶特征学习分解形式的简约模型。我们通过数值实验证明了我们的低秩张量方法如何获得与相应的非参数模型相同的性能,始终优于随机傅立叶特征。 摘要:Random Fourier features provide a way to tackle large-scale machine learning problems with kernel methods. Their slow Monte Carlo convergence rate has motivated the research of deterministic Fourier features whose approximation error decreases exponentially with the number of frequencies. However, due to their tensor product structure these methods suffer heavily from the curse of dimensionality, limiting their applicability to two or three-dimensional scenarios. In our approach we overcome said curse of dimensionality by exploiting the tensor product structure of deterministic Fourier features, which enables us to represent the model parameters as a low-rank tensor decomposition. We derive a monotonically converging block coordinate descent algorithm with linear complexity in both the sample size and the dimensionality of the inputs for a regularized squared loss function, allowing to learn a parsimonious model in decomposed form using deterministic Fourier features. We demonstrate by means of numerical experiments how our low-rank tensor approach obtains the same performance of the corresponding nonparametric model, consistently outperforming random Fourier features.
【4】 Dive into Layers: Neural Network Capacity Bounding using Algebraic Geometry 标题:深入层次:使用代数几何学的神经网络容量界限 链接:https://arxiv.org/abs/2109.01461
作者:Ji Yang,Lu Sang,Daniel Cremers 机构:Enflame-tech, ShengXia Road, Pudong New District, Shanghai, Technical University of Munich, Boltzmannstrasse , Garching, Germany 摘要:实证结果表明,神经网络的可学习性与其大小直接相关。为了从数学上证明这一点,我们借用拓扑代数中的一个工具:Betti数来测量输入数据和神经网络的拓扑几何复杂性。通过描述一个具有拓扑复杂性的神经网络的表达能力,我们进行了深入的分析,表明该网络的表达能力受到其层次规模的限制。进一步,我们推导了网络中每一层上Betti数的上界。因此,将神经网络的结构选择问题转化为确定能够表示输入数据复杂性的网络规模。根据给出的结果,全连接网络的架构选择归结为选择合适的网络大小,使其配备不小于输入数据Betti数的Betti数。我们在真实数据集MNIST上进行了实验,结果验证了我们的分析和结论。该代码将公开提供。 摘要:The empirical results suggest that the learnability of a neural network is directly related to its size. To mathematically prove this, we borrow a tool in topological algebra: Betti numbers to measure the topological geometric complexity of input data and the neural network. By characterizing the expressive capacity of a neural network with its topological complexity, we conduct a thorough analysis and show that the network's expressive capacity is limited by the scale of its layers. Further, we derive the upper bounds of the Betti numbers on each layer within the network. As a result, the problem of architecture selection of a neural network is transformed to determining the scale of the network that can represent the input data complexity. With the presented results, the architecture selection of a fully connected network boils down to choosing a suitable size of the network such that it equips the Betti numbers that are not smaller than the Betti numbers of the input data. We perform the experiments on a real-world dataset MNIST and the results verify our analysis and conclusion. The code will be publicly available.
【5】 Topographic VAEs learn Equivariant Capsules 标题:地形VAE学习等变胶囊 链接:https://arxiv.org/abs/2109.01394
作者:T. Anderson Keller,Max Welling 机构:UvA-Bosch Delta Lab, University of Amsterdam 摘要:在这项工作中,我们试图在神经网络中连接地形组织和等变的概念。为了实现这一点,我们引入了地形VAE:一种有效训练具有地形组织潜变量的深层生成模型的新方法。我们表明,这样一个模型确实学会了根据MNIST上的显著特征(如数字类别、宽度和样式)组织其激活。此外,通过随时间变化的地形组织(即时间相干性),我们展示了如何为观察到的变换输入序列鼓励预定义的潜在空间变换算子——一种无监督学习等变的原始形式。我们证明,该模型成功地从序列中直接学习近似等变特征集(即“胶囊”),并在相应转换测试序列时获得更高的可能性。通过测量推理网络和序列变换的近似交换性,定量地验证了等变性。最后,我们证明了复杂变换的近似等变性,扩展了现有的群等变神经网络的能力。 摘要:In this work we seek to bridge the concepts of topographic organization and equivariance in neural networks. To accomplish this, we introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables. We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST. Furthermore, through topographic organization over time (i.e. temporal coherence), we demonstrate how predefined latent space transformation operators can be encouraged for observed transformed input sequences -- a primitive form of unsupervised learned equivariance. We demonstrate that this model successfully learns sets of approximately equivariant features (i.e. "capsules") directly from sequences and achieves higher likelihood on correspondingly transforming test sequences. Equivariance is verified quantitatively by measuring the approximate commutativity of the inference network and the sequence transformations. Finally, we demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
【6】 Access Control Using Spatially Invariant Permutation of Feature Maps for Semantic Segmentation Models 标题:基于特征图空间不变置换的语义分割模型访问控制 链接:https://arxiv.org/abs/2109.01332
作者:Hiroki Ito,MaungMaung AprilPyone,Hitoshi Kiya 机构:Tokyo Metropolitan University, Japan 备注:To appear in 13th Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC 2021) 摘要:在本文中,我们提出了一种访问控制方法,该方法使用具有秘密密钥的特征映射的空间不变排列来保护语义分割模型。分割模型通过使用密钥排列选定的特征映射进行训练和测试。所提出的方法不仅允许拥有正确密钥的合法用户以最大容量访问模型,而且还可以降低未授权用户的性能。传统的访问控制方法只关注图像分类任务,而这些方法从未应用于语义分割任务。在一项实验中,受保护的模型被证明可以让合法用户获得与非受保护模型几乎相同的性能,但也能够抵抗未经授权的用户在没有密钥的情况下进行的访问。此外,在语义分割模型下,传统的分块变换方法也被证明性能下降。 摘要:In this paper, we propose an access control method that uses the spatially invariant permutation of feature maps with a secret key for protecting semantic segmentation models. Segmentation models are trained and tested by permuting selected feature maps with a secret key. The proposed method allows rightful users with the correct key not only to access a model to full capacity but also to degrade the performance for unauthorized users. Conventional access control methods have focused only on image classification tasks, and these methods have never been applied to semantic segmentation tasks. In an experiment, the protected models were demonstrated to allow rightful users to obtain almost the same performance as that of non-protected models but also to be robust against access by unauthorized users without a key. In addition, a conventional method with block-wise transformations was also verified to have degraded performance under semantic segmentation models.
【7】 Estimating Demand Flexibility Using Siamese LSTM Neural Networks 标题:基于暹罗LSTM神经网络的需求弹性估算 链接:https://arxiv.org/abs/2109.01258
作者:Guangchun Ruan,Daniel S. Kirschen,Haiwang Zhong,Qing Xia,Chongqing Kang 机构: Department of Electrical Engineering, Tsinghua University, Kirschen is with the Department of Electrical & Computer Engineer-ing, University of Washington 备注:Author copy of the manuscript submitted to IEEE Trans on Power Systems 摘要:在现代电力系统中,有机会通过动态价格激励消费者,探索需求的灵活性。在本文中,我们使用一种称为时变弹性的有效工具来量化需求灵活性,该工具的价值可能随价格和决策动态而变化。该工具对于评估需求响应潜力和系统可靠性特别有用。最近的经验证据表明,在研究需求灵活性时,存在一些异常特征,如响应延迟和价格飙升后弹性消失。现有的方法无法捕获这些复杂的特征,因为它们严重依赖于一些预定义(通常过于简化)的回归表达式。相反,本文提出了一种无模型方法来自动准确地导出最佳估计模式。我们进一步利用连体长短时记忆(LSTM)网络开发了一个两阶段估计过程。这里,一个LSTM网络对价格响应进行编码,而另一个网络估计时变弹性。在案例研究中,与最先进的方法相比,所提出的框架和模型得到了验证,以实现更高的总体估计精度和对各种异常特征的更好描述。 摘要:There is an opportunity in modern power systems to explore the demand flexibility by incentivizing consumers with dynamic prices. In this paper, we quantify demand flexibility using an efficient tool called time-varying elasticity, whose value may change depending on the prices and decision dynamics. This tool is particularly useful for evaluating the demand response potential and system reliability. Recent empirical evidences have highlighted some abnormal features when studying demand flexibility, such as delayed responses and vanishing elasticities after price spikes. Existing methods fail to capture these complicated features because they heavily rely on some predefined (often over-simplified) regression expressions. Instead, this paper proposes a model-free methodology to automatically and accurately derive the optimal estimation pattern. We further develop a two-stage estimation process with Siamese long short-term memory (LSTM) networks. Here, a LSTM network encodes the price response, while the other network estimates the time-varying elasticities. In the case study, the proposed framework and models are validated to achieve higher overall estimation accuracy and better description for various abnormal features when compared with the state-of-the-art methods.
其他(9篇)
【1】 LightAutoML: AutoML Solution for a Large Financial Services Ecosystem 标题:LightAutoML:面向大型金融服务生态系统的AutoML解决方案 链接:https://arxiv.org/abs/2109.01528
作者:Anton Vakhrushev,Alexander Ryzhkov,Maxim Savchenko,Dmitry Simakov,Rinchin Damdinov,Alexander Tuzhilin 机构:Sber AI Lab, Stern School of Business, NYU 摘要:我们介绍了一个名为LightAutoML的AutoML系统,该系统是为一家大型欧洲金融服务公司开发的,其生态系统满足了该生态系统对AutoML解决方案的一系列特殊要求。我们的框架在许多应用程序中进行了试点和部署,并在经验丰富的数据科学家的水平上执行,同时构建高质量的ML模型的速度明显快于这些数据科学家。我们还将我们的系统的性能与各种通用开源AutoML解决方案进行了比较,结果表明,对于大多数生态系统和OpenML问题,它的性能更好。我们还介绍了在开发AutoML系统并将其投入生产过程中所学到的经验教训。 摘要:We present an AutoML system called LightAutoML developed for a large European financial services company and its ecosystem satisfying the set of idiosyncratic requirements that this ecosystem has for AutoML solutions. Our framework was piloted and deployed in numerous applications and performed at the level of the experienced data scientists while building high-quality ML models significantly faster than these data scientists. We also compare the performance of our system with various general-purpose open source AutoML solutions and show that it performs better for most of the ecosystem and OpenML problems. We also present the lessons that we learned while developing the AutoML system and moving it into production.
【2】 Semi-Implicit Neural Solver for Time-dependent Partial Differential Equations 标题:含时偏微分方程的半隐式神经网络求解器 链接:https://arxiv.org/abs/2109.01467
作者:Suprosanna Shit,Ivan Ezhov,Leon Mächler,Abinav R.,Jana Lipkova,Johannes C. Paetzold,Florian Kofler,Marie Piraud,Bjoern H. Menze 机构:Department of Informatics, Technical University Munich, Leon M¨achler, D´epartement d’informatique, ´Ecole normale sup´erieure, Paris, deepc GmbH, Munich, Brigham and Women’s Hospital, Harvard Medical School, Helmholtz AI, Helmholtz Zentrum M¨unchen 摘要:含时偏微分方程(PDE)的快速精确解是许多研究领域的关键兴趣,包括物理、工程和生物学。通常,隐式/半隐式格式优于显式格式,以提高稳定性和正确性。然而,现有的半隐式方法通常是迭代的,并且使用通用的解算器,这对于特定类别的偏微分方程可能是次优的。在本文中,我们提出了一个神经求解器来学习一个最优的迭代方案在数据驱动的方式为任何类别的偏微分方程。具体来说,我们使用深度神经网络修改半隐式解算器的单个迭代。我们为神经求解器的正确性和收敛性提供了理论保证,类似于传统的迭代求解器。除了常用的Dirichlet边界条件外,我们还采用扩散域方法来合并不同类型的边界条件,例如Neumann。我们证明了所提出的神经求解器可以超越线性偏微分方程,并适用于一类非线性偏微分方程,其中非线性分量是非刚性的。我们在2D和3D场景中演示了我们的方法的有效性。为此,我们展示了我们的模型如何推广到参数设置,这与训练不同;与半隐式格式相比,具有更快的收敛速度。 摘要:Fast and accurate solutions of time-dependent partial differential equations (PDEs) are of pivotal interest to many research fields, including physics, engineering, and biology. Generally, implicit/semi-implicit schemes are preferred over explicit ones to improve stability and correctness. However, existing semi-implicit methods are usually iterative and employ a general-purpose solver, which may be sub-optimal for a specific class of PDEs. In this paper, we propose a neural solver to learn an optimal iterative scheme in a data-driven fashion for any class of PDEs. Specifically, we modify a single iteration of a semi-implicit solver using a deep neural network. We provide theoretical guarantees for the correctness and convergence of neural solvers analogous to conventional iterative solvers. In addition to the commonly used Dirichlet boundary condition, we adopt a diffuse domain approach to incorporate a diverse type of boundary conditions, e.g., Neumann. We show that the proposed neural solver can go beyond linear PDEs and applies to a class of non-linear PDEs, where the non-linear component is non-stiff. We demonstrate the efficacy of our method on 2D and 3D scenarios. To this end, we show how our model generalizes to parameter settings, which are different from training; and achieves faster convergence than semi-implicit schemes.
【3】 A New Approach to Multilabel Stratified Cross Validation with Application to Large and Sparse Gene Ontology Datasets 标题:一种新的多标签分层交叉验证方法及其在大型稀疏基因本体数据集上的应用 链接:https://arxiv.org/abs/2109.01425
作者:Henri Tiittanen,Liisa Holm,Petri Törönen 机构:Institute of Biotechnology, Helsinki Institute of Life Sciences, (HiLife), University of Helsinki, Helsinki, Finland, Organismal and Evotionary Biology Research Program, Biosciences, University of Helsinki 备注:12 pages, 1 figure 摘要:多标签学习是机器学习研究中的一个重要课题。在多标签设置中评估模型需要为多标签数据设计特定的交叉验证方法。在本文中,我们展示了文献中广泛使用的评估指标中的一个弱点,并介绍了该指标的改进版本和优化交叉验证拆分的通用方法optisplit。我们对各种类型的交叉验证方法进行了广泛的比较,结果表明optisplit产生的交叉验证拆分比现有方法更好,并且速度足够快,可以用于大基因本体(GO)数据集 摘要:Multilabel learning is an important topic in machine learning research. Evaluating models in multilabel settings requires specific cross validation methods designed for multilabel data. In this article, we show a weakness in an evaluation metric widely used in literature and we present improved versions of this metric and a general method, optisplit, for optimising cross validations splits. We present an extensive comparison of various types of cross validation methods in which we show that optisplit produces better cross validation splits than the existing methods and that it is fast enough to be used on big Gene Ontology (GO) datasets
【4】 Instance-wise or Class-wise? A Tale of Neighbor Shapley for Concept-based Explanation 标题:实例智能还是类智能?基于概念解释的邻居沙普利故事 链接:https://arxiv.org/abs/2109.01369
作者:Jiahui Li,Kun Kuang,Lin Li,Long Chen,Songyang Zhang,Jian Shao,Jun Xiao 机构: Zhejiang University,Columbia University,University of Rochester 摘要:深度神经网络在许多数据驱动和面向预测的应用中表现出了卓越的性能,有时甚至比人类表现得更好。然而,它们最显著的缺点是缺乏可解释性,这使得它们在许多实际应用中没有吸引力。当涉及道德问题或不确定的环境因素(如犯罪判断、财务分析和医疗诊断)时,挖掘模型预测(解释模型知识)的证据以说服人类是至关重要的。因此,研究如何解释模型知识对于学术研究和实际应用都至关重要。 摘要:Deep neural networks have demonstrated remarkable performance in many data-driven and prediction-oriented applications, and sometimes even perform better than humans. However, their most significant drawback is the lack of interpretability, which makes them less attractive in many real-world applications. When relating to the moral problem or the environmental factors that are uncertain such as crime judgment, financial analysis, and medical diagnosis, it is essential to mine the evidence for the model's prediction (interpret model knowledge) to convince humans. Thus, investigating how to interpret model knowledge is of paramount importance for both academic research and real applications.
【5】 How to Inject Backdoors with Better Consistency: Logit Anchoring on Clean Data 标题:如何以更好的一致性注入后门:基于干净数据的Logit锚定 链接:https://arxiv.org/abs/2109.01300
作者:Zhiyuan Zhang,Lingjuan Lyu,Weiqiang Wang,Lichao Sun,Xu Sun 摘要:由于从头开始训练大规模后门模型需要大量的训练数据集,最近的几次攻击都考虑在不改变干净数据上的模型行为的情况下,将后门注入经过训练的干净模型中。以前的工作发现,后门可以注入到一个经过训练的具有对抗性权重扰动(AWP)的干净模型中。这里,AWPs指的是在后门学习中参数的变化很小。在这项工作中,我们观察到一个有趣的现象,即当调整经过训练的干净模型以注入后门时,参数的变化总是AWP。我们进一步提供理论分析来解释这一现象。我们将在干净数据上保持准确性的行为描述为后门模型的一致性,包括全局一致性和实例一致性。我们广泛分析了AWPs对后门模型一致性的影响。为了获得更好的一致性,我们提出了一种新的锚定损失来锚定或冻结干净数据上的模型行为,并提供了理论保证。分析和实证结果都验证了锚定损失在提高一致性,特别是实例一致性方面的有效性。 摘要:Since training a large-scale backdoored model from scratch requires a large training dataset, several recent attacks have considered to inject backdoors into a trained clean model without altering model behaviors on the clean data. Previous work finds that backdoors can be injected into a trained clean model with Adversarial Weight Perturbation (AWP). Here AWPs refers to the variations of parameters that are small in backdoor learning. In this work, we observe an interesting phenomenon that the variations of parameters are always AWPs when tuning the trained clean model to inject backdoors. We further provide theoretical analysis to explain this phenomenon. We formulate the behavior of maintaining accuracy on clean data as the consistency of backdoored models, which includes both global consistency and instance-wise consistency. We extensively analyze the effects of AWPs on the consistency of backdoored models. In order to achieve better consistency, we propose a novel anchoring loss to anchor or freeze the model behaviors on the clean data, with a theoretical guarantee. Both the analytical and the empirical results validate the effectiveness of the anchoring loss in improving the consistency, especially the instance-wise consistency.
【6】 So Cloze yet so Far: N400 Amplitude is Better Predicted by Distributional Information than Human Predictability Judgements 标题:所以到目前为止,完形填空:与人类可预测性判断相比,分布信息更好地预测了N400振幅 链接:https://arxiv.org/abs/2109.01226
作者:James A. Michaelov,Seana Coulson,Benjamin K. Bergen 机构: Bergen are with the Department ofCognitive Science, University of California San Diego 备注:Submitted 摘要:更易预测的单词更容易处理——它们的阅读速度更快,并引发与处理困难相关的更小的神经信号,最显著的是事件相关脑电位的N400成分。因此,有人认为,预测即将出现的单词是语言理解的一个关键组成部分,而研究N400的振幅是研究我们所做预测的一个有价值的方法。在这项研究中,我们调查了计算语言模型或人类的语言预测是否更好地反映了自然语言刺激调节N400振幅的方式。人类语言预测与计算语言模型的一个重要区别是,虽然语言模型的预测完全基于前面的语言背景,但人类可能依赖其他因素。我们发现三种顶尖的当代语言模型——GPT-3、RoBERTa和ALBERT——的预测比人类的预测更接近N400。这表明N400背后的预测过程可能比以前认为的对语言表层统计更敏感。 摘要:More predictable words are easier to process - they are read faster and elicit smaller neural signals associated with processing difficulty, most notably, the N400 component of the event-related brain potential. Thus, it has been argued that prediction of upcoming words is a key component of language comprehension, and that studying the amplitude of the N400 is a valuable way to investigate the predictions that we make. In this study, we investigate whether the linguistic predictions of computational language models or humans better reflect the way in which natural language stimuli modulate the amplitude of the N400. One important difference in the linguistic predictions of humans versus computational language models is that while language models base their predictions exclusively on the preceding linguistic context, humans may rely on other factors. We find that the predictions of three top-of-the-line contemporary language models - GPT-3, RoBERTa, and ALBERT - match the N400 more closely than human predictions. This suggests that the predictive processes underlying the N400 may be more sensitive to the surface-level statistics of language than previously thought.
【7】 Frequency-Severity Experience Rating based on Latent Markovian Risk Profiles 标题:基于潜在马尔可夫风险分布的频度-严重性体验评级 链接:https://arxiv.org/abs/2109.01413
作者:Robert Matthijs Verschuren 机构:∗, Amsterdam School of Economics, University of Amsterdam, . 摘要:奖金MALUS系统传统上考虑客户的索赔数量,不管它们的大小,即使这些组件在实践中是依赖的。我们提出了一种新的基于潜在马尔可夫风险模型的联合经验评级方法,以允许正或负的个体频率严重性依赖性。潜在轮廓在隐马尔可夫模型中随时间演化,以捕获客户索赔体验中的更新,从而使索赔数量和规模有条件地独立。我们表明,由此产生的风险溢价导致了标准可信度溢价的动态、索赔经验加权混合。所提出的方法被应用于荷兰汽车保险组合,并识别具有独特索赔行为的客户风险状况。这些概况反过来使我们能够更好地区分客户风险。 摘要:Bonus-Malus Systems traditionally consider a customer's number of claims irrespective of their sizes, even though these components are dependent in practice. We propose a novel joint experience rating approach based on latent Markovian risk profiles to allow for a positive or negative individual frequency-severity dependence. The latent profiles evolve over time in a Hidden Markov Model to capture updates in a customer's claims experience, making claim counts and sizes conditionally independent. We show that the resulting risk premia lead to a dynamic, claims experience-weighted mixture of standard credibility premia. The proposed approach is applied to a Dutch automobile insurance portfolio and identifies customer risk profiles with distinctive claiming behavior. These profiles, in turn, enable us to better distinguish between customer risks.
【8】 Two Shifts for Crop Mapping: Leveraging Aggregate Crop Statistics to Improve Satellite-based Maps in New Regions 标题:作物制图的两个转变:利用综合作物统计数据改进新区域的卫星地图 链接:https://arxiv.org/abs/2109.01246
作者:Dan M. Kluger,Sherrie Wang,David B. Lobell 机构:a Department of Statistics, Sequoia Hall, Mail Code , Jane Stanford Way, Stanford, University, Stanford, CA ,-, United States of America, b Department of Earth System Science and Center on Food Security and the Environment, Encina 备注:None 摘要:农田一级的作物类型制图对于农业监测的各种应用至关重要,卫星图像正成为制作作物类型地图的日益丰富和有用的原始输入。尽管如此,在许多地区,利用卫星数据绘制作物类型图仍然受到缺乏用于训练监督分类模型的田间作物标签的限制。当一个区域中没有可用的训练数据时,可以传输在类似区域中训练的分类器,但作物类型分布的变化以及区域之间特征的转换会导致分类精度降低。我们提出了一种方法,通过考虑这两种类型的偏移,使用聚合级作物统计来校正分类器。为了调整作物类型组成的变化,我们提出了一种方案,用于适当地重新加权分类器输出的每个类别的后验概率。为了调整特征中的偏移,我们提出了一种估计和去除平均特征向量中线性偏移的方法。我们证明,当使用线性判别分析(LDA)绘制法国西塔尼省和肯尼亚西部省的作物类型时,该方法可显著提高总体分类精度。当使用LDA作为我们的基本分类器时,我们发现在法国,我们的方法使11个不同训练部门的误分类率降低了2.8%至42.2%(平均值=21.9%),在肯尼亚,三个训练区域的误分类率分别降低了6.6%、28.4%和42.7%。虽然我们的方法在统计学上是由LDA分类器驱动的,但它可以应用于任何类型的分类器。作为一个例子,我们展示了它在改进随机森林分类器中的成功应用。 摘要:Crop type mapping at the field level is critical for a variety of applications in agricultural monitoring, and satellite imagery is becoming an increasingly abundant and useful raw input from which to create crop type maps. Still, in many regions crop type mapping with satellite data remains constrained by a scarcity of field-level crop labels for training supervised classification models. When training data is not available in one region, classifiers trained in similar regions can be transferred, but shifts in the distribution of crop types as well as transformations of the features between regions lead to reduced classification accuracy. We present a methodology that uses aggregate-level crop statistics to correct the classifier by accounting for these two types of shifts. To adjust for shifts in the crop type composition we present a scheme for properly reweighting the posterior probabilities of each class that are output by the classifier. To adjust for shifts in features we propose a method to estimate and remove linear shifts in the mean feature vector. We demonstrate that this methodology leads to substantial improvements in overall classification accuracy when using Linear Discriminant Analysis (LDA) to map crop types in Occitanie, France and in Western Province, Kenya. When using LDA as our base classifier, we found that in France our methodology led to percent reductions in misclassifications ranging from 2.8% to 42.2% (mean = 21.9%) over eleven different training departments, and in Kenya the percent reductions in misclassification were 6.6%, 28.4%, and 42.7% for three training regions. While our methodology was statistically motivated by the LDA classifier, it can be applied to any type of classifier. As an example, we demonstrate its successful application to improve a Random Forest classifier.
【9】 Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development 标题:面向高质量大语音数据集开发的可扩展数据标注流水线 链接:https://arxiv.org/abs/2109.01164
作者:Mingkuan Liu,Chi Zhang,Hua Xing,Chao Feng,Monchu Chen,Judith Bishop,Grace Ngapo 机构:Appen 备注:Submitted to NeurIPS 2021 Datasets and Benchmarks Track (Round 2) 摘要:本文介绍了一种人在回路(HITL)数据注释管道,用于生成高质量、大规模的语音数据集。该管道结合了人和机器的优点,通过机器预标记和完全手动审核,可以更快速、准确、经济高效地对数据集进行注释。注释管道中采用了质量控制机制,如盲测试、行为监控和数据验证,以减轻机器生成标签带来的潜在偏差。我们的A/B测试和试点结果表明,HITL管道可以将注释速度和容量提高至少80%,并且质量与手动双通道注释相当或更高。我们正在利用这一可扩展的管道来创建并持续增长用于多种语言的超高容量现成(UHV-OTS)语音语料库,其能力每年扩展到每种语言10000多小时。可以使用动态包装从UHV-OTS语料库生成定制数据集。UHV-OTS是一个长期的Appen项目,用于支持语音处理中的商业和学术研究数据需求。Appen将每年从UHV-OTS捐赠大量言论自由数据集,以支持CC-BY-SA许可下的学术和开源社区研究。我们还在Apache2.0许可下发布数据预处理和预标记管道的代码,以允许复制论文中报告的结果。 摘要:This paper introduces a human-in-the-loop (HITL) data annotation pipeline to generate high-quality, large-scale speech datasets. The pipeline combines human and machine advantages to more quickly, accurately, and cost-effectively annotate datasets with machine pre-labeling and fully manual auditing. Quality control mechanisms such as blind testing, behavior monitoring, and data validation have been adopted in the annotation pipeline to mitigate potential bias introduced by machine-generated labels. Our A/B testing and pilot results demonstrated the HITL pipeline can improve annotation speed and capacity by at least 80% and quality is comparable to or higher than manual double pass annotation. We are leveraging this scalable pipeline to create and continuously grow ultra-high volume off-the-shelf (UHV-OTS) speech corpora for multiple languages, with the capability to expand to 10,000 hours per language annually. Customized datasets can be produced from the UHV-OTS corpora using dynamic packaging. UHV-OTS is a long-term Appen project to support commercial and academic research data needs in speech processing. Appen will donate a number of free speech datasets from the UHV-OTS each year to support academic and open source community research under the CC-BY-SA license. We are also releasing the code of the data pre-processing and pre-tagging pipeline under the Apache 2.0 license to allow reproduction of the results reported in the paper.
机器翻译,仅供参考