统计学学术速递[6.18]

2021-07-02 19:03:56 浏览数 (1)

访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问

stat统计学,共计48篇

【1】 Spectral goodness-of-fit tests for complete and partial network data 标题:完全和部分网络数据的光谱拟合优度检验

作者:Shane Lubold,Bolun Liu,Tyler H. McCormick 机构: Research reported in this publication was supported by the National Institute Of MentalHealth of the National Institutes of Health under Award Number DP 2MH 1 2 2 40 5 链接:https://arxiv.org/abs/2106.09702 摘要:网络描述了个体参与者之间的关系,这种关系往往很复杂。在这项工作中,我们讨论如何确定一个参数模型,如随机块模型或潜在空间模型,是否适合一个数据集,并将外推到类似的数据。我们使用随机矩阵理论的最新结果来推导二进数据的一般拟合优度检验。我们表明,我们的方法,当应用于一个特定的模型的兴趣,提供了一个简单的,计算速度快的方法来选择参数在一些常用的网络模型。例如,我们展示了如何在潜在空间模型中选择潜在空间的维数。与其他网络拟合优度方法不同,我们的通用方法不需要从候选参数模型进行模拟,这对于大型图来说可能很麻烦,并且不需要在图上选择一组特定的统计数据进行比较。它还允许我们对部分网络数据(如聚合关系数据)执行拟合优度测试。我们的模拟结果显示,我们的方法在许多感兴趣的情况下表现良好。我们分析了几个经验相关的网络,并表明我们的方法导致改进的社区检测算法。Github上提供了实现我们方法的R代码。 摘要:Networks describe the, often complex, relationships between individual actors. In this work, we address the question of how to determine whether a parametric model, such as a stochastic block model or latent space model, fits a dataset well and will extrapolate to similar data. We use recent results in random matrix theory to derive a general goodness-of-fit test for dyadic data. We show that our method, when applied to a specific model of interest, provides an straightforward, computationally fast way of selecting parameters in a number of commonly used network models. For example, we show how to select the dimension of the latent space in latent space models. Unlike other network goodness-of-fit methods, our general approach does not require simulating from a candidate parametric model, which can be cumbersome with large graphs, and eliminates the need to choose a particular set of statistics on the graph for comparison. It also allows us to perform goodness-of-fit tests on partial network data, such as Aggregated Relational Data. We show with simulations that our method performs well in many situations of interest. We analyze several empirically relevant networks and show that our method leads to improved community detection algorithms. R code to implement our method is available on Github.

【2】 Optimal Relevant Subset Designs in Nonlinear Models 标题:非线性模型中的最优相关子集设计

作者:Adam Lane 备注:25 pages, 6 figures, 1 table 链接:https://arxiv.org/abs/2106.09633 摘要:Fisher(1934)认为某些辅助统计形成了一个相关的子集,一个样本空间的子集,在这个子集上的推断应该受到限制,并表明对它们的观测值的条件化在不损失信息的情况下降低了数据的维数。辅助统计在事后数据推断中的应用受到了广泛的关注;然而,它们在实验设计中的作用还没有得到很好的描述。在收集数据之前,辅助统计数据是未知的,因此不能先验地纳入设计中。但是,如果数据是按顺序观察的,则基于先前观察数据的辅助统计数据可用于确定当前观察的设计分配。这项工作的主要结果描述了将辅助统计,特别是构成相关子集的辅助统计纳入自适应设计的好处。 摘要:Fisher (1934) argued that certain ancillary statistics form a relevant subset, a subset of the sample space on which inference should be restricted, and showed that conditioning on their observed value reduces the dimension of the data without a loss of information. The use of ancillary statistics in post-data inference has received significant attention; however, their role in the design of the experiment has not been well characterized. Ancillary statistics are unknown prior to data collection and as a result cannot be incorporated into the design a priori. However, if the data are observed sequentially then the ancillary statistics based on the data from the preceding observations can be used to determine the design assignment for the current observation. The main results of this work describe the benefits of incorporating ancillary statistics, specifically, the ancillary statistic that constitutes a relevant subset, into an adaptive design.

【3】 Large-Scale Multiple Testing for Matrix-Valued Data under Double Dependency 标题:双重依赖下矩阵值数据的大规模多重检验

作者:Xu Han,Sanat Sarkar,Shiyu Zhang 链接:https://arxiv.org/abs/2106.09632 摘要:基于矩阵值数据的高维推理在现代统计研究中受到越来越多的关注,而专门用于分析这类数据集的大规模多重检验却没有取得太大的进展。基于此,我们在这篇文章中考虑了一个脑电图(EEG)实验,该实验产生矩阵值数据,并提出了一个基于矩阵值数据的多测试方法的开发范围,以控制在这种实验中重要假设的错误发现。以矩阵形式出现的观测值的行-列相互依赖性,称为双重依赖性,是开发这种方法的主要挑战之一。我们通过假设每个独立矩阵数据点的观测值为矩阵正态分布来解决这个问题。这使我们能够充分捕捉通过行和列协方差矩阵通知的潜在双重依赖,并开发出比通过矢量化每个数据点从而忽略双重依赖而获得的相应方法(例如,Fan和Han(2017))更强大的方法。提出了两种具有统计精度的近似错误发现率的方法。其中一种方法是双重依赖下的一种通用方法,而另一种方法在高维情况下具有更高的计算效率。大量的数值研究表明,与Fan和Han(2017)的主因子近似方法相比,所提出的方法具有更高的性能。所提出的方法已进一步应用于上述脑电数据。 摘要:High-dimensional inference based on matrix-valued data has drawn increasing attention in modern statistical research, yet not much progress has been made in large-scale multiple testing specifically designed for analysing such data sets. Motivated by this, we consider in this article an electroencephalography (EEG) experiment that produces matrix-valued data and presents a scope of developing novel matrix-valued data based multiple testing methods controlling false discoveries for hypotheses that are of importance in such an experiment. The row-column cross-dependency of observations appearing in a matrix form, referred to as double-dependency, is one of the main challenges in the development of such methods. We address it by assuming matrix normal distribution for the observations at each of the independent matrix data-points. This allows us to fully capture the underlying double-dependency informed through the row- and column-covariance matrices and develop methods that are potentially more powerful than the corresponding one (e.g., Fan and Han (2017)) obtained by vectorizing each data point and thus ignoring the double-dependency. We propose two methods to approximate the false discovery proportion with statistical accuracy. While one of these methods is a general approach under double-dependency, the other one provides more computational efficiency for higher dimensionality. Extensive numerical studies illustrate the superior performance of the proposed methods over the principal factor approximation method of Fan and Han (2017). The proposed methods have been further applied to the aforementioned EEG data.

【4】 Disentangling Identifiable Features from Noisy Data with Structured Nonlinear ICA 标题:基于结构化非线性ICA的噪声数据可识别特征解缠

作者:Hermanni Hälvä,Sylvain Le Corff,Luc Lehéricy,Jonathan So,Yongjie Zhu,Elisabeth Gassiat,Aapo Hyvarinen 机构:Aapo Hyvärinen, †, Department of Computer Science, University of Helsinki, Samovar, Télécom SudParis, département CITI, Institut Polytechnique de Paris, Palaiseau, France, Laboratoire J. A. Dieudonné, Université Côte d’Azur, CNRS, Nice, France 备注:preprint 链接:https://arxiv.org/abs/2106.09620 摘要:我们介绍了一个新的一般可识别框架的原则解纠缠称为结构非线性独立成分分析(SNICA)。我们的贡献是将深生成模型的可辨识性理论推广到一类非常广泛的结构化模型。虽然以前的工作已经证明了时间序列模型的特殊类别的可识别性,我们的定理将其扩展到更一般的时间结构以及具有更复杂结构(如空间依赖)的模型。特别地,我们建立了一个主要的结果,即即使在存在未知分布的噪声的情况下,该框架的可辨识性仍然成立。因此,SNICA设置包含了时间序列的所有现有非线性ICA模型,并允许新的更丰富的可识别模型。最后,作为我们框架灵活性的一个例子,我们介绍了第一个时间序列的非线性ICA模型,它结合了以下非常有用的特性:它在完全无监督的环境中同时考虑了非平稳性和自相关性;进行降维;模型隐藏状态;并利用变分极大似然法进行有原则的估计和推理。 摘要:We introduce a new general identifiable framework for principled disentanglement referred to as Structured Nonlinear Independent Component Analysis (SNICA). Our contribution is to extend the identifiability theory of deep generative models for a very broad class of structured models. While previous works have shown identifiability for specific classes of time-series models, our theorems extend this to more general temporal structures as well as to models with more complex structures such as spatial dependencies. In particular, we establish the major result that identifiability for this framework holds even in the presence of noise of unknown distribution. The SNICA setting therefore subsumes all the existing nonlinear ICA models for time-series and also allows for new much richer identifiable models. Finally, as an example of our framework's flexibility, we introduce the first nonlinear ICA model for time-series that combines the following very useful properties: it accounts for both nonstationarity and autocorrelation in a fully unsupervised setting; performs dimensionality reduction; models hidden states; and enables principled estimation and inference by variational maximum-likelihood.

【5】 Hierarchical surrogate-based Approximate Bayesian Computation for an electric motor test bench 标题:基于分层代理的电机试验台近似贝叶斯计算

作者:David N. John,Livia Stohrer,Claudia Schillings,Michael Schick,Vincent Heuveline 机构:Bosch Research, Renningen, Germany, Engineering Mathematics and Computing Lab (EMCL), Interdisciplinary Center for, Scientific Computing (IWR), Heidelberg University, Germany, Mathematical Optimization Group, Institute of Mathematics, University of Mannheim 链接:https://arxiv.org/abs/2106.09597 摘要:从噪声时间序列数据中推断复杂工业系统的参数分布需要处理基础数据和所用仿真模型的不确定性的方法。贝叶斯推理非常适合于这些不确定逆问题。用于识别不确定参数的标准方法是马尔可夫链蒙特卡罗(MCMC)方法,该方法具有明确的似然函数评估。然而,如果可能性是非常复杂的,以致其评估在计算上是昂贵的,甚至在其显式形式中是未知的,则近似贝叶斯计算(ABC)方法提供了一种有希望的替代方法。在这项工作中,这两种方法首先应用于人工生成的数据和第二个现实世界的问题,通过使用一个电机试验台的数据。我们证明了这两种方法都能用贝叶斯层次方法推断出变化参数的分布。但是,为了获得类似精度的结果,提出的ABC方法计算效率更高。我们建议使用摘要统计来减少数据的维数,这大大提高了算法的效率。进一步用多项式混沌展开(PCE)替代仿真模型,提高了模型的计算速度。在(近似)正演模型上,我们证明了所提出的基于替代项的ABC方法在温和条件下的一致性。 摘要:Inferring parameter distributions of complex industrial systems from noisy time series data requires methods to deal with the uncertainty of the underlying data and the used simulation model. Bayesian inference is well suited for these uncertain inverse problems. Standard methods used to identify uncertain parameters are Markov Chain Monte Carlo (MCMC) methods with explicit evaluation of a likelihood function. However, if the likelihood is very complex, such that its evaluation is computationally expensive, or even unknown in its explicit form, Approximate Bayesian Computation (ABC) methods provide a promising alternative. In this work both methods are first applied to artificially generated data and second on a real world problem, by using data of an electric motor test bench. We show that both methods are able to infer the distribution of varying parameters with a Bayesian hierarchical approach. But the proposed ABC method is computationally much more efficient in order to achieve results with similar accuracy. We suggest to use summary statistics in order to reduce the dimension of the data which significantly increases the efficiency of the algorithm. Further the simulation model is replaced by a Polynomial Chaos Expansion (PCE) surrogate to speed up model evaluations. We proof consistency for the proposed surrogate-based ABC method with summary statistics under mild conditions on the (approximated) forward model.

【6】 Estimating spatially-varying density and time-varying demographics with open population spatial capture-recapture: a photo-ID case study on bottlenose dolphins in Barataria Bay, Louisana, USA 标题:利用开放种群空间捕获-重新捕获估计空间变化的密度和时间变化的人口统计:美国路易斯安那州巴拉塔里亚湾宽吻海豚的照片ID案例研究

作者:Richard Glennie,Len Thomas,Todd Speakman,Lance Garrison,Ryan Takeshita,Lori Schwacke 机构:Centre for Research into Ecological and Environmental Modelling, University of St Andrews, St Andrews, KY,LZ, UK, National Marine Mammal Foundation, San Diego, CA, USA, National Marine Fisheries Service, Southeast Fisheries Science Center, Miami, FL , USA 备注:34 pages 链接:https://arxiv.org/abs/2106.09579 摘要:1.通过长期的空间捕获-再捕获(SCR)调查,我们推断了种群在时间上的动态以及在空间上的分布。将这些开放种群SCR(openSCR)模型拟合到大型数据集并包含复杂的模型组件(例如,空间变化的密度面和时变的种群动力学)在计算上变得越来越可行。然而,关于这些方法如何执行的知识有限。2.作为一个案例研究,我们分析了路易萨那州巴拉塔里亚湾的宽吻海豚(Tursiops truncatus)多年的照片ID调查,美国。由于2010年附近的深水地平线漏油事件的影响,对这一人群进行了监测。在2010年至2019年期间收集了2000多个捕获历史记录。我们的目标是确定将openSCR方法应用于实际数据的挑战,并为使用这些方法的其他分析师描述一个工作流程。3.我们表明,对石油泄漏后存活率、招募率和密度随时间变化的推断有助于深入了解石油泄漏后死亡率的增加、此后人口的可能再分配以及人口的持续下降。应用中的问题始终突出:可能的模型错误说明,参数对模型选择的敏感性,以及由于模型假设和不规则的时间和空间测量而难以解释结果。对于每一个问题,我们都提供了实用的解决方案,包括评估拟合优度、模型平均值以及澄清定量结果和定性解释之间的差异。4.总的来说,本案例研究是其他分析师可以遵循和扩展的实用模板;这也强调了需要进一步研究这些方法的适用性,因为我们需要从中得到更丰富的推论。 摘要:1. From long-term, spatial capture-recapture (SCR) surveys we infer a population's dynamics over time and distribution over space. It is becoming more computationally feasible to fit these open population SCR (openSCR) models to large datasets and include complex model components, e.g., spatially-varying density surfaces and time-varying population dynamics. Yet, there is limited knowledge on how these methods perform. 2. As a case study, we analyze a multi-year, photo-ID survey on bottlenose dolphins (Tursiops truncatus) in Barataria Bay, Louisana, USA. This population has been monitored due to the impacts of the nearby Deepwater Horizon oil spill in 2010. Over 2000 capture histories have been collected between 2010 and 2019. Our aim is to identify the challenges in applying openSCR methods to real data and to describe a workflow for other analysts using these methods. 3. We show that inference on survival, recruitment, and density over time since the oil spill provides insight into increased mortality after the spill, possible redistribution of the population thereafter, and continued population decline. Issues in the application are highlighted throughout: possible model misspecification, sensitivity of parameters to model selection, and difficulty in interpreting results due to model assumptions and irregular surveying in time and space. For each issue, we present practical solutions including assessing goodness-of-fit, model-averaging, and clarifying the difference between quantitative results and their qualitative interpretation. 4. Overall, this case study serves as a practical template other analysts can follow and extend; it also highlights the need for further research on the applicability of these methods as we demand richer inference from them.

【7】 A Deep Reinforcement Learning Approach towards Pendulum Swing-up Problem based on TF-Agents 标题:基于TF-Agents的摆起问题的深度强化学习方法

作者:Yifei Bi,Xinyi Chen,Caihui Xiao 机构:Department of Statistics, Columbia University in the City of New York, New York, USA 链接:https://arxiv.org/abs/2106.09556 摘要:采用深度Q学习agent训练手杖的思想,可以得到一个很好的防止手杖坠落的效果。强化学习(RL)从环境与agent的交互中学习的能力提供了一种最优的控制策略。本文旨在解决经典的摆起问题,即使学习的摆处于直立位置并保持平衡。在该问题中,引入了深度确定的策略梯度算法对连续作用域进行操作。在提高平均收益率、减少损失和视频直播的情况下,证明了最优摆的显著结果。 摘要:Adapting the idea of training CartPole with Deep Q-learning agent, we are able to find a promising result that prevent the pole from falling down. The capacity of reinforcement learning (RL) to learn from the interaction between the environment and agent provides an optimal control strategy. In this paper, we aim to solve the classic pendulum swing-up problem that making the learned pendulum to be in upright position and balanced. Deep Deterministic Policy Gradient algorithm is introduced to operate over continuous action domain in this problem. Salient results of optimal pendulum are proved with increasing average return, decreasing loss, and live video in the code part.

【8】 Satellite conjunction assessment: Statistical space oddity? 标题:卫星联合评估:统计空间奇特之处?

作者:Soumaya Elkantassi,Anthony Davison 机构: DavisonInstitute of Mathematics 备注:26 pages, 3 figures 链接:https://arxiv.org/abs/2106.09541 摘要:涉及空间物体“未遂”的卫星连接越来越可能。其中一种风险分析方法涉及碰撞概率的计算,但这被认为具有一些违反直觉的性质,其作为概率的含义尚不清楚。我们在一个简单的统计模型的基础上提出了一种新的方法,可以对两个目标之间的脱靶量进行高精度的推断,证明了这一方法与默认的贝叶斯方法非常接近,并用一个实例说明了该方法,并给出了montecarlo结果来说明其优异的性能。 摘要:Satellite conjunctions involving "near misses" of space objects are becoming increasingly likely. One approach to risk analysis for them involves the computation of the collision probability, but this has been regarded as having some counter-intuitive properties, and its meaning as a probability is unclear. We formulate an alternative approach based on a simple statistical model that allows highly accurate inference on the miss distance between the two objects, show that this provides a close approximation to a default Bayesian approach, illustrate the method with a case study, and give Monte Carlo results to show its excellent performance.

【9】 Machine learning methods for postprocessing ensemble forecasts of wind gusts: A systematic comparison 标题:后处理阵风集合预报的机器学习方法:系统比较

作者:Benedikt Schulz,Sebastian Lerch 机构:Karlsruhe Institute of Technology, Heidelberg Institute for Theoretical Studies 链接:https://arxiv.org/abs/2106.09512 摘要:对集合天气预报进行后处理以纠正系统误差已成为研究和业务中的标准做法。然而,尽管阵风预报在灾害性天气预警中具有重要意义,但目前对阵风预报的集成后处理研究较少。在这里,我们提供了一个全面的回顾和系统的比较8统计和机器学习方法的概率阵风预报集成后处理,可分为三组:国家最先进的后处理技术,从统计(集成模型输出统计(EMOS),逐员后处理、等张分布回归)、已建立的机器学习方法(梯度增强扩展EMOS、分位数回归森林)和基于神经网络的方法(分布回归网络、Bernstein分位数网络、直方图估计网络)。利用德国气象局运行的高分辨率、允许对流的集合预报系统6年的数据和德国175个地面气象站的逐时观测资料,系统地比较了这些方法。虽然所有的后处理方法都能产生校准的预报,并能修正原始集合预报的系统误差,但将来自阵风以外的其他气象预报变量的信息结合起来,可显著提高预报技巧。特别是,我们提出了一个灵活的局部自适应神经网络框架,以不同的概率预测类型作为输出,不仅显著优于所有的基准后处理方法,而且还学习了与日周期相关的物理一致性关系,特别是行星边界层的夜间转变。 摘要:Postprocessing ensemble weather predictions to correct systematic errors has become a standard practice in research and operations. However, only few recent studies have focused on ensemble postprocessing of wind gust forecasts, despite its importance for severe weather warnings. Here, we provide a comprehensive review and systematic comparison of eight statistical and machine learning methods for probabilistic wind gust forecasting via ensemble postprocessing, that can be divided in three groups: State of the art postprocessing techniques from statistics (ensemble model output statistics (EMOS), member-by-member postprocessing, isotonic distributional regression), established machine learning methods (gradient-boosting extended EMOS, quantile regression forests) and neural network-based approaches (distributional regression network, Bernstein quantile network, histogram estimation network). The methods are systematically compared using six years of data from a high-resolution, convection-permitting ensemble prediction system that was run operationally at the German weather service, and hourly observations at 175 surface weather stations in Germany. While all postprocessing methods yield calibrated forecasts and are able to correct the systematic errors of the raw ensemble predictions, incorporating information from additional meteorological predictor variables beyond wind gusts leads to significant improvements in forecast skill. In particular, we propose a flexible framework of locally adaptive neural networks with different probabilistic forecast types as output, which not only significantly outperform all benchmark postprocessing methods but also learn physically consistent relations associated with the diurnal cycle, especially the evening transition of the planetary boundary layer.

【10】 Maximum Entropy Spectral Analysis: a case study 标题:最大熵谱分析:一个实例研究

作者:Alessandro Martini,Stefano Schmidt,Walter Del Pozzo 机构: Università di Pisa, com 2 Institute for Gravitational and Subatomic Physics (GRASP), Utrecht University 备注:16 pages, 13 figure, submitted to A&A 链接:https://arxiv.org/abs/2106.09499 摘要:Burg提出的最大熵谱分析(MESA)方法为时间序列的谱估计提供了有力的工具。该方法依赖于Jaynes的最大熵原理,并提供了一种根据某一阶自回归过程AR($p$)的系数推断随机过程谱的方法。一个封闭形式的递归解决方案提供了一个自回归系数的估计以及进程的顺序$p$。我们以python包texttt{memspectrum}的形式提供了该算法的现成实现。我们通过对合成数据(具有已知的功率谱密度)进行功率谱密度分析来描述我们的实现,并比较了停止递归的不同标准。此外,我们还利用LIGO-Virgo合作团队发布的频谱生成的合成数据,将我们的代码与普遍存在的Welch算法的性能进行了比较。我们发现,与Welch方法相比,Burg方法提供了一种具有较低方差和偏差的功率谱密度(PSD)估计。这在少量数据点的情况下尤其明显,使得Burg的方法最适合在这种情况下工作。 摘要:The Maximum Entropy Spectral Analysis (MESA) method, developed by Burg, provides a powerful tool to perform spectral estimation of a time-series. The method relies on a Jaynes' maximum entropy principle and provides the means of inferring the spectrum of a stochastic process in terms of the coefficients of some autoregressive process AR($p$) of order $p$. A closed form recursive solution provides an estimate of the autoregressive coefficients as well as of the order $p$ of the process. We provide a ready-to-use implementation of the algorithm in the form of a python package texttt{memspectrum}. We characterize our implementation by performing a power spectral density analysis on synthetic data (with known power spectral density) and we compare different criteria for stopping the recursion. Furthermore, we compare the performance of our code with the ubiquitous Welch algorithm, using synthetic data generated from the released spectrum by the LIGO-Virgo collaboration. We find that, when compared to Welch's method, Burg's method provides a power spectral density (PSD) estimation with a systematically lower variance and bias. This is particularly manifest in the case of a little number of data points, making Burg's method most suitable to work in this regime.

【11】 Optimum Allocation for Adaptive Multi-Wave Sampling in R: The R Package optimall 标题:自适应多波采样在R中的最优分配:R包最优

作者:Jasper B. Yang,Bryan E. Shepherd,Thomas Lumley,Pamela A. Shaw 备注:31 pages, 7 figures 链接:https://arxiv.org/abs/2106.09494 摘要:R软件包Optimal提供了一系列功能,可有效简化从简单到复杂的调查抽样设计过程。该软件包的主要功能允许用户根据辅助协变量的值或分位数交互式地定义和调整地层切割点,使用Neyman或Wright分配自适应地计算分配给每个地层的最佳样本数,并基于分层抽样设计选择特定的id进行抽样。通过实际流行病学研究的例子,我们展示了Optimal如何促进R。尽管针对两相或三相设计下的多波取样,R包Optimal可能对任何取样调查都有用。 摘要:The R package optimall offers a collection of functions that efficiently streamline the design process of sampling in surveys ranging from simple to complex. The package's main functions allow users to interactively define and adjust strata cut points based on values or quantiles of auxiliary covariates, adaptively calculate the optimum number of samples to allocate to each stratum using Neyman or Wright allocation, and select specific IDs to sample based on a stratified sampling design. Using real-life epidemiological study examples, we demonstrate how optimall facilitates an efficient workflow for the design and implementation of surveys in R. Although tailored towards multi-wave sampling under two- or three-phase designs, the R package optimall may be useful for any sampling survey.

【12】 Importance measures derived from random forests: characterisation and extension 标题:随机森林的重要性测度:刻画和推广

作者:Antonio Sutera 机构:Advisors:, Prof. PIERRE GEURTS, Prof. LOUIS WEHENKEL, arXiv:,.,v, [stat.ML] , Jun 备注:PhD thesis, Li`ege, Belgium, June 2019. Permalink : this http URL 链接:https://arxiv.org/abs/2106.09473 摘要:如今,新技术,特别是人工智能,在我们的社会中越来越成熟。大数据分析和机器学习是人工智能的两个子领域,是许多应用领域(如医学、通信、金融等)最近取得突破的核心,包括一些与我们日常生活密切相关的领域(如社交网络、计算机、智能手机等)。在机器学习中,显著的改进通常是以不断增加的计算复杂度为代价的,这要归功于更大的数据集。目前,由最先进的机器学习算法构建的尖端模型通常同时变得非常高效和有利可图,但也非常复杂。它们的复杂性是如此之大,以至于这些模型通常被视为提供无法解释或证明的预测或决策的黑匣子。然而,无论这些模型是自动使用还是作为一个简单的决策支持工具,它们已经被用于机器学习应用程序中,那里的健康和人类生命都受到威胁。因此,显然有必要在不详细了解这些模型的预测或决定的情况下,不要盲目相信这些模型产生的一切。因此,本论文的目的在于提高由一系列特定的机器学习算法所建立的模型的可解释性,即所谓的基于树的方法。一些机制已经被提出来解释这些模型,我们的目的是沿着这篇论文来提高他们的理解,研究他们的性质,并界定他们的局限性。 摘要:Nowadays new technologies, and especially artificial intelligence, are more and more established in our society. Big data analysis and machine learning, two sub-fields of artificial intelligence, are at the core of many recent breakthroughs in many application fields (e.g., medicine, communication, finance, ...), including some that are strongly related to our day-to-day life (e.g., social networks, computers, smartphones, ...). In machine learning, significant improvements are usually achieved at the price of an increasing computational complexity and thanks to bigger datasets. Currently, cutting-edge models built by the most advanced machine learning algorithms typically became simultaneously very efficient and profitable but also extremely complex. Their complexity is to such an extent that these models are commonly seen as black-boxes providing a prediction or a decision which can not be interpreted or justified. Nevertheless, whether these models are used autonomously or as a simple decision-making support tool, they are already being used in machine learning applications where health and human life are at stake. Therefore, it appears to be an obvious necessity not to blindly believe everything coming out of those models without a detailed understanding of their predictions or decisions. Accordingly, this thesis aims at improving the interpretability of models built by a specific family of machine learning algorithms, the so-called tree-based methods. Several mechanisms have been proposed to interpret these models and we aim along this thesis to improve their understanding, study their properties, and define their limitations.

【13】 Taming Nonconvexity in Kernel Feature Selection---Favorable Properties of the Laplace Kernel 标题:核特征选择中抑制非凸性-拉普拉斯核的良好性质

作者:Feng Ruan,Keli Liu,Michael I. Jordan 机构:University of California, Berkeley†, The Voleon Group∗ 备注:33 pages main text; 链接:https://arxiv.org/abs/2106.09387 摘要:基于核函数的特征选择是非参数统计的重要工具。尽管基于核函数的特征选择方法有很多实际应用,但是很少有统计理论支持该方法。一个核心挑战是用于定义基于核的特征选择的优化问题的目标函数是非凸的。文献只研究了全局最优解的统计性质,这是一个不匹配,因为基于梯度的非凸优化算法只能保证收敛到局部极小值。研究与基于核的方法相关联的完整景观,我们表明,使用拉普拉斯核(和其他$ellu 1$核)的特征选择目标具有其他核,包括普遍存在的高斯核(或其他$ellu 2$核)不具有的统计保证。基于目标函数梯度的一个尖锐特征,我们证明了$ellu 1$核消除了使用$ellu 2$核时出现的不利平稳点。基于这一观点,我们为基于$ellu 1$内核的特征选择建立了统计保证,这些特征选择不需要达到全局最小值。特别地,我们建立了基于$ellu 1$核的特征选择的模型选择一致性,用$simlogp$样本恢复非参数设置下的主效应和层次交互。 摘要:Kernel-based feature selection is an important tool in nonparametric statistics. Despite many practical applications of kernel-based feature selection, there is little statistical theory available to support the method. A core challenge is the objective function of the optimization problems used to define kernel-based feature selection are nonconvex. The literature has only studied the statistical properties of the emph{global optima}, which is a mismatch, given that the gradient-based algorithms available for nonconvex optimization are only able to guarantee convergence to local minima. Studying the full landscape associated with kernel-based methods, we show that feature selection objectives using the Laplace kernel (and other $ell_1$ kernels) come with statistical guarantees that other kernels, including the ubiquitous Gaussian kernel (or other $ell_2$ kernels) do not possess. Based on a sharp characterization of the gradient of the objective function, we show that $ell_1$ kernels eliminate unfavorable stationary points that appear when using an $ell_2$ kernel. Armed with this insight, we establish statistical guarantees for $ell_1$ kernel-based feature selection which do not require reaching the global minima. In particular, we establish model-selection consistency of $ell_1$-kernel-based feature selection in recovering main effects and hierarchical interactions in the nonparametric setting with $n sim log p$ samples.

【14】 An efficient parallel block coordinate descent algorithm for large-scale precision matrix estimation using graphics processing units 标题:一种高效的基于图形处理器的大规模精度矩阵估计的并行挡路坐标下降算法

作者:Young-Geun Choi,Seunghwan Lee,Donghyeon Yu 链接:https://arxiv.org/abs/2106.09382 摘要:大规模稀疏精度矩阵估计引起了统计学界的广泛兴趣。最近,由Khare等人(2015)开发的凸偏相关选择方法(CONCORD)被认为具有估计稀疏精度矩阵的一些理论性质。基于目标函数的凸性,采用坐标下降算法(CONCORD-CD)求解。然而,由于CONCORD-CD中的坐标更新本质上是串行的,因此放大是不寻常的。本文提出了一种新的CONCORD-CD并行化方法,即CONCORD-PCD。CONCORD-PCD将非对角元素划分为若干组,并在不影响CONCORD-CD计算收敛性的前提下对每组元素进行同步更新。我们利用图论中边着色的概念来保证这一点。具体地说,我们在CONCORD-CD中安排非对角元素的更新和对完全图的边着色之间建立了一个非平凡的对应关系。结果是,CONCORD-PCD同时更新非对角元素,其中相关的边可以用相同的颜色着色。结果,更新非对角元素所需的步骤数从p(p-1)/2减少到p-1(对于偶数p)或p(对于奇数p),其中p表示变量的数目。我们证明了这样的步数是不可约的。此外,CONCORD-PCD是为单指令多数据(SIMD)并行设计的。数值研究表明,在图形处理单元(GPU)中实现的SIMD并行化PCD算法对CONCORD-CD算法的性能提高了很多倍。 摘要:Large-scale sparse precision matrix estimation has attracted wide interest from the statistics community. The convex partial correlation selection method (CONCORD) developed by Khare et al. (2015) has recently been credited with some theoretical properties for estimating sparse precision matrices. The CONCORD obtains its solution by a coordinate descent algorithm (CONCORD-CD) based on the convexity of the objective function. However, since a coordinate-wise update in CONCORD-CD is inherently serial, a scale-up is nontrivial. In this paper, we propose a novel parallelization of CONCORD-CD, namely, CONCORD-PCD. CONCORD-PCD partitions the off-diagonal elements into several groups and updates each group simultaneously without harming the computational convergence of CONCORD-CD. We guarantee this by employing the notion of edge coloring in graph theory. Specifically, we establish a nontrivial correspondence between scheduling the updates of the off-diagonal elements in CONCORD-CD and coloring the edges of a complete graph. It turns out that CONCORD-PCD simultaneously updates off-diagonal elements in which the associated edges are colorable with the same color. As a result, the number of steps required for updating off-diagonal elements reduces from p(p-1)/2 to p-1 (for even p) or p (for odd p), where p denotes the number of variables. We prove that the number of such steps is irreducible In addition, CONCORD-PCD is tailored to single-instruction multiple-data (SIMD) parallelism. A numerical study shows that the SIMD-parallelized PCD algorithm implemented in graphics processing units (GPUs) boosts the CONCORD-CD algorithm multiple times.

【15】 Optimal Design of Stress Levels in Accelerated Degradation Testing for Multivariate Linear Degradation Models 标题:多元线性退化模型加速退化试验中应力水平的优化设计

作者:Helmi Shat 机构:Institute for Mathematical Stochastics, Otto-von-Guericke University Magdeburg, PF , Magdeburg, Germany 备注:arXiv admin note: substantial text overlap with arXiv:2102.09446 链接:https://arxiv.org/abs/2106.09379 摘要:近年来,加速退化试验越来越受到人们的重视,目的是准确估计系统的可靠性特性,使其能够正常工作数年甚至数十年在这方面,使用适当的统计模型外推来自压力变量特定测试水平的退化数据,以获得正常使用水平下寿命分位数的估计值。在本文中,我们提出了最佳的重复测量加速退化试验设计与竞争失效模式,对应于多个响应组件。假设观测时间点是固定的,并且是预先知道的。边际退化路径用线性混合效应模型表示。在正常使用条件下,通过最小化失效时间分布某个分位数的估计量的渐近方差,得到最优设计。通过数值算例验证了所提出的优化设计的鲁棒性,并与标准实验设计进行了比较。 摘要:In recent years, more attention has been paid prominently to accelerated degradation testing in order to characterize accurate estimation of reliability properties for systems that are designed to work properly for years of even decades. %In this regard, degradation data from particular testing levels of the stress variable(s) are extrapolated with an appropriate statistical model to obtain estimates of lifetime quantiles at normal use levels. In this paper we propose optimal experimental designs for repeated measures accelerated degradation tests with competing failure modes that correspond to multiple response components. The observation time points are assumed to be fixed and known in advance. The marginal degradation paths are expressed using linear mixed effects models. The optimal design is obtained by minimizing the asymptotic variance of the estimator of some quantile of the failure time distribution at the normal use conditions. Numerical examples are introduced to ensure the robustness of the proposed optimal designs and compare their efficiency with standard experimental designs.

【16】 Differentially Private Hamiltonian Monte Carlo 标题:微分私有哈密顿蒙特卡罗

作者:Ossi Räisä,Antti Koskela,Antti Honkela 机构:Helsinki Institute for Information technology HIIT, Department of Computer Science, University of Helsinki 备注:18 pages, 3 figures 链接:https://arxiv.org/abs/2106.09376 摘要:马尔可夫链蒙特卡罗(MCMC)算法一直是贝叶斯推理的主要工具。其中,哈密顿蒙特卡罗(HMC)由于有效地利用了目标分布的梯度而成为近年来研究的热点。在隐私保护机器学习中,差分隐私(DP)已成为保证数据主体隐私不受侵犯的金标准。现有的DP-MCMC算法要么使用随机游走方案,要么不使用Metropolis-Hastings(MH)验收测试来确保收敛性,而不将步长减小到零。我们提出了一种基于MH接受测试的HMC的DP变体,该测试建立在最近提出的DP-MCMC算法(称为惩罚算法)的基础上,并在HMC的梯度评估中加入了噪声。我们证明了算法收敛到正确的分布,并且是遍历的。我们将DP-HMC算法与现有的惩罚算法、DP-SGLD算法和DP-SGNHT算法进行了比较,发现DP-HMC算法的性能优于或等于惩罚算法,并且性能比DP-SGLD算法和DP-SGNHT算法更加一致。 摘要:Markov chain Monte Carlo (MCMC) algorithms have long been the main workhorses of Bayesian inference. Among them, Hamiltonian Monte Carlo (HMC) has recently become very popular due to its efficiency resulting from effective use of the gradients of the target distribution. In privacy-preserving machine learning, differential privacy (DP) has become the gold standard in ensuring that the privacy of data subjects is not violated. Existing DP MCMC algorithms either use random-walk proposals, or do not use the Metropolis--Hastings (MH) acceptance test to ensure convergence without decreasing their step size to zero. We present a DP variant of HMC using the MH acceptance test that builds on a recently proposed DP MCMC algorithm called the penalty algorithm, and adds noise to the gradient evaluations of HMC. We prove that the resulting algorithm converges to the correct distribution, and is ergodic. We compare DP-HMC with the existing penalty, DP-SGLD and DP-SGNHT algorithms, and find that DP-HMC has better or equal performance than the penalty algorithm, and performs more consistently than DP-SGLD or DP-SGNHT.

【17】 Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds, and Benign Overfitting 标题:插值器的一致收敛性:高斯宽度、范数界和良性过拟合

作者:Frederic Koehler,Lijia Zhou,Danica J. Sutherland,Nathan Srebro 机构:MIT, University of Chicago, UBC and Amii, TTI-Chicago, Collaboration on the Theoretical Foundations of Deep Learning (deepfoundations.ai) 链接:https://arxiv.org/abs/2106.09276 摘要:研究了高斯数据下高维线性回归中的插值学习问题,证明了在任意假设类中,插值器的泛化误差在类的高斯宽度上是一致收敛的。将泛型界应用于欧氏范数球,恢复了Bartlett等人(2020)关于最小范数插值器的一致性结果,并证实了Zhou等人(2020)关于高斯数据特殊情况下的近似最小范数插值器的预测。通过将其应用于单纯形,我们证明了界的一般性,得到了最小l1范数插值(基追踪)的一个新的一致性结果。我们的结果表明,基于范数的泛化边界可以解释和用于分析良性过拟合,至少在某些情况下。 摘要:We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class's Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators, and confirms a prediction of Zhou et al. (2020) for near-minimal-norm interpolators in the special case of Gaussian data. We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum l1-norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings.

【18】 Localized Uncertainty Attacks 标题:局部化不确定性攻击

作者:Ousmane Amadou Dia,Theofanis Karaletsos,Caner Hazirbas,Cristian Canton Ferrer,Ilknur Kaynar Kabul,Erik Meijer 机构:Facebook 备注:CVPR 2021 Workshop on Adversarial Machine Learning in Computer Vision 链接:https://arxiv.org/abs/2106.09222 摘要:深度学习模型对对抗性干扰的敏感性在对抗性例子中引起了新的关注,导致了许多攻击。然而,这些攻击中的大多数都未能涵盖人类无法察觉的大范围敌对干扰。本文提出了一类新的针对确定性和随机分类器的威胁模型——局部不确定性攻击。在这个威胁模型下,我们只通过扰动输入中分类器不确定的区域来创建对抗性的例子。为了找到这样的区域,我们利用分类器的预测不确定性当分类器是随机的,或者,我们学习一个代理模型来摊销不确定性当它是确定的。与$ellp$ball或不加区别地干扰输入的功能性攻击不同,我们的目标变化可能不太明显。在我们的威胁模型下,这些攻击仍然会产生强大的对抗性例子;示例与输入保持了更大程度的相似性。 摘要:The susceptibility of deep learning models to adversarial perturbations has stirred renewed attention in adversarial examples resulting in a number of attacks. However, most of these attacks fail to encompass a large spectrum of adversarial perturbations that are imperceptible to humans. In this paper, we present localized uncertainty attacks, a novel class of threat models against deterministic and stochastic classifiers. Under this threat model, we create adversarial examples by perturbing only regions in the inputs where a classifier is uncertain. To find such regions, we utilize the predictive uncertainty of the classifier when the classifier is stochastic or, we learn a surrogate model to amortize the uncertainty when it is deterministic. Unlike $ell_p$ ball or functional attacks which perturb inputs indiscriminately, our targeted changes can be less perceptible. When considered under our threat model, these attacks still produce strong adversarial examples; with the examples retaining a greater degree of similarity with the inputs.

【19】 Optimum-statistical collaboration towards efficient black-box optimization 标题:面向高效黑盒优化的最优统计协作

作者:Wenjie Li,Chihua Wang,Guang Cheng 机构:Department of Statistics, Purdue University, West Lafayette, IN 链接:https://arxiv.org/abs/2106.09215 摘要:随着训练中涉及的超参数越来越多,机器学习系统需要更好地理解超参数自动调整。这引起了人们对可证明黑盒优化的研究兴趣,通过在算法设计中实现更好的探索机制,管理优化和统计误差的流量,使得黑盒优化更加实用。以前的工作主要集中在描述优化误差,但这是不足的:黑盒优化算法可能是低效的,不考虑异质性之间的奖励样本。本文对统计不确定性在黑箱优化中的作用做了重点刻画,指导了更高效的算法设计。我们介绍了一个管理优化误差流和优化过程中统计误差流之间相互作用的框架textit{optimum statistical collaboration}。受这个框架的启发,我们提出了只考虑局部光滑性假设的目标函数的texttt{VHCT}算法。从理论上证明了该算法具有速率最优后悔界;在实验中,我们证明了该算法在广泛的设置优于先前的努力。 摘要:With increasingly more hyperparameters involved in their training, machine learning systems demand a better understanding of hyperparameter tuning automation. This has raised interest in studies of provably black-box optimization, which is made more practical by better exploration mechanism implemented in algorithm design, managing the flux of both optimization and statistical errors. Prior efforts focus on delineating optimization errors, but this is deficient: black-box optimization algorithms can be inefficient without considering heterogeneity among reward samples. In this paper, we make the key delineation on the role of statistical uncertainty in black-box optimization, guiding a more efficient algorithm design. We introduce textit{optimum-statistical collaboration}, a framework of managing the interaction between optimization error flux and statistical error flux evolving in the optimization process. Inspired by this framework, we propose the texttt{VHCT} algorithms for objective functions with only local-smoothness assumptions. In theory, we prove our algorithm enjoys rate-optimal regret bounds; in experiments, we show the algorithm outperforms prior efforts in extensive settings.

【20】 Binary classification with corrupted labels 标题:标签损坏的二进制分类

作者:Yonghoon Lee,Rina Foygel Barber 链接:https://arxiv.org/abs/2106.09136 摘要:在二值分类问题中,目标是拟合一个准确的预测值,训练数据集中存在损坏的标签可能会产生额外的挑战。然而,在可能性最大化表现不佳的设置中,例如,如果正标签和负标签是完全可分离的,那么一小部分损坏的标签可以通过确保健壮性来提高性能。在这项工作中,我们建立了在这种情况下,腐败行为作为一种形式的正则化,我们计算精确的上界估计误差在腐败的存在。我们的结果表明,损坏的数据点的存在只对总样本的一小部分有利,并随样本大小的平方根缩放。 摘要:In a binary classification problem where the goal is to fit an accurate predictor, the presence of corrupted labels in the training data set may create an additional challenge. However, in settings where likelihood maximization is poorly behaved-for example, if positive and negative labels are perfectly separable-then a small fraction of corrupted labels can improve performance by ensuring robustness. In this work, we establish that in such settings, corruption acts as a form of regularization, and we compute precise upper bounds on estimation error in the presence of corruptions. Our results suggest that the presence of corrupted data points is beneficial only up to a small fraction of the total sample, scaling with the square root of the sample size.

【21】 Clustering inference in multiple groups 标题:多组中的聚类推理

作者:Debora Zava Bello,Marcio Valk,Gabriela Bettella Cybis 机构:Department of Statistics, Federal University of Rio Grande do Sul, Jun 链接:https://arxiv.org/abs/2106.09115 摘要:聚类推理对于揭示数据中固有的群体结构至关重要。聚类方法由于其在高维数据模式识别中的重要作用,近年来引起了人们的广泛关注。我们在这里提出了一种基于U统计量的方法,专门为高维数据定制,它将数据分为三组,同时评估这种划分的重要性。由于我们的方法是基于U-statistics的R包uclust方法的聚类框架,它继承了非参数方法依赖于对数据很少的假设的特点,因此可以应用于广泛的数据集。此外,我们的方法旨在成为一个更强大的工具,找到最佳的分区数据分成三组时,特定的结构是存在的。为了做到这一点,我们首先提出了一个扩展的测试U-统计和发展其渐近理论。此外,我们还提出了一种三元非嵌套意义聚类方法。我们的方法是通过多次模拟测试,并发现有更多的统计权力比竞争的替代方案在所有情况下考虑。应用于外周血单个核细胞和图像识别显示我们的建议的多功能性,提出了一个优于其他方法的性能。 摘要:Inference in clustering is paramount to uncovering inherent group structure in data. Clustering methods which assess statistical significance have recently drawn attention owing to their importance for the identification of patterns in high dimensional data with applications in many scientific fields. We present here a U-statistics based approach, specially tailored for high-dimensional data, that clusters the data into three groups while assessing the significance of such partitions. Because our approach stands on the U-statistics based clustering framework of the methods in R package uclust, it inherits its characteristics being a non-parametric method relying on very few assumptions about the data, and thus can be applied to a wide range of dataset. Furthermore our method aims to be a more powerful tool to find the best partitions of the data into three groups when that particular structure is present. In order to do so, we first propose an extension of the test U-statistic and develop its asymptotic theory. Additionally we propose a ternary non-nested significance clustering method. Our approach is tested through multiple simulations and found to have more statistical power than competing alternatives in all scenarios considered. Applications to peripheral blood mononuclear cells and to image recognition shows the versatility of our proposal, presenting a superior performance when compared with other approaches.

【22】 Semiparametric count data regression for self-reported mental health 标题:心理健康自报的半参数计数数据回归分析

作者:Daniel R. Kowal,Bohan Wu 链接:https://arxiv.org/abs/2106.09114 摘要:“在过去的30天里,你有多少天的心理健康不好?”对这个问题的回答衡量了自我报告的心理健康状况,并与国家健康和营养检查调查(NHANES)中的重要协变量有关。然而,这些计数变量提出了主要的分布挑战:数据过度分散,零膨胀,以30为界,以5天和7天为增量堆积。为了应对这些挑战,我们设计了一个计数数据回归的半参数估计和推理框架。数据生成过程是通过同时变换和舍入一个潜在的高斯回归模型来定义的。变换是非参数估计的,舍入算子保证了对离散和有界数据的正确支持。最大似然估计量是使用EM算法计算的,该算法与任何可由最小二乘法估计的连续数据模型兼容。星回归包括渐近假设检验和置信区间,通过信息标准选择变量,以及定制诊断。仿真研究验证了该框架的实用性。STAR被用于研究与自我报告的心理健康相关的因素,并证明与现有的计数数据回归模型相比,拟合优度有了实质性的改善。 摘要:"For how many days during the past 30 days was your mental health not good?" The responses to this question measure self-reported mental health and can be linked to important covariates in the National Health and Nutrition Examination Survey (NHANES). However, these count variables present major distributional challenges: the data are overdispersed, zero-inflated, bounded by 30, and heaped in five- and seven-day increments. To meet these challenges, we design a semiparametric estimation and inference framework for count data regression. The data-generating process is defined by simultaneously transforming and rounding (STAR) a latent Gaussian regression model. The transformation is estimated nonparametrically and the rounding operator ensures the correct support for the discrete and bounded data. Maximum likelihood estimators are computed using an EM algorithm that is compatible with any continuous data model estimable by least squares. STAR regression includes asymptotic hypothesis testing and confidence intervals, variable selection via information criteria, and customized diagnostics. Simulation studies validate the utility of this framework. STAR is deployed to study the factors associated with self-reported mental health and demonstrates substantial improvements in goodness-of-fit compared to existing count data regression models.

【23】 Scenario Generation of Wind Farm Power for Real-Time System Operation 标题:面向实时系统运行的风电场发电情景生成

作者:Trevor Werho,Junshan Zhang,Vijay Vittal,Yonghong Chen,Anupam Thatte,Long Zhao 机构:W 备注:8 pages, 9 figures, submitted to IEEE transactions on Power Systems 链接:https://arxiv.org/abs/2106.09105 摘要:本文提出了一种支持实时优化工具的风电场场景生成方法,并给出了其中的关键发现。这项工作借鉴了文献中的工作,提出了一种高效且可扩展的方法,用于为大型风电场机组生成足够数量的场景,同时捕获空间和时间依赖性。该方法利用条件异方差回归对每个风电场和时间范围进行概率预测。将过去的训练数据(使用概率预测模型)转换为标准正态样本。从归一化样本中估计高斯copula,并实时使用它来执行适当的空间和时间依赖性。使用MISO的历史数据对该方法进行了评估,并讨论了MISO实时前瞻框架下的性能。 摘要:This work proposes a method of wind farm scenario generation to support real-time optimization tools and presents key findings therein. This work draws upon work from the literature and presents an efficient and scalable method for producing an adequate number of scenarios for a large fleet of wind farms while capturing both spatial and temporal dependencies. The method makes probabilistic forecasts using conditional heteroscedastic regression for each wind farm and time horizon. Past training data is transformed (using the probabilistic forecasting models) into standard normal samples. A Gaussian copula is estimated from the normalized samples and used in real-time to enforce proper spatial and temporal dependencies. The method is evaluated using historical data from MISO and performance within the MISO real-time look-ahead framework is discussed.

【24】 Maximum likelihood estimation for mechanistic network models 标题:机械网络模型的极大似然估计

作者:Jonathan Larson,Jukka-Pekka Onnela 备注:29 pages, 8 figures 链接:https://arxiv.org/abs/2106.09100 摘要:机械网络模型规定了网络增长和变化的机制,使研究人员能够利用模拟和分析技术研究复杂系统。不幸的是,很难写出用机械模型生成的图实例的似然度,因为重复应用机械模型的结果会出现组合爆炸。因此,几乎不可能使用最大似然估计来估计参数。在本文中,我们建议将成长型网路模型中的节点序列视为附加参数,或视为遗漏的随机变数,并将所得的可能性最大化。我们在一个简单的机制网络模型的背景下开发了这个框架,用于研究基因复制和发散,并测试了在模拟图中最大化可能性的各种算法。我们还在一个人类蛋白质相互作用网络和四个非人类蛋白质相互作用网络上运行了性能最好的算法。虽然我们关注的是一个特定的机械网络模型,但提出的框架更适用于可逆模型。 摘要:Mechanistic network models specify the mechanisms by which networks grow and change, allowing researchers to investigate complex systems using both simulation and analytical techniques. Unfortunately, it is difficult to write likelihoods for instances of graphs generated with mechanistic models because of a combinatorial explosion in outcomes of repeated applications of the mechanism. Thus it is near impossible to estimate the parameters using maximum likelihood estimation. In this paper, we propose treating node sequence in a growing network model as an additional parameter, or as a missing random variable, and maximizing over the resulting likelihood. We develop this framework in the context of a simple mechanistic network model, used to study gene duplication and divergence, and test a variety of algorithms for maximizing the likelihood in simulated graphs. We also run the best-performing algorithm on a human protein-protein interaction network and four non-human protein-protein interaction networks. Although we focus on a specific mechanistic network model here, the proposed framework is more generally applicable to reversible models.

【25】 Pre-processing with Orthogonal Decompositions for High-dimensional Explanatory Variables 标题:高维解释变量的正交分解预处理

作者:Xu Han,Ethan X Fang,Cheng Yong Tang 机构:Department of Statistical Science, Fox Business School, Temple University, Philadelphia, PA , USA, Ethan X. Fang, Department of Statistics, The Pennsylvania State University, University Park, PA , USA, Editor: 链接:https://arxiv.org/abs/2106.09071 摘要:解释变量之间的强相关性对于高维正则回归方法是有问题的。由于违反了不可再现条件,流行的套索方法可能会出现非活动变量的错误包含。本文提出用正交分解(PROD)对高维回归中的解释变量进行预处理。PROD过程是基于设计矩阵的一般正交分解构造的。我们通过两个具体的例子证明了PROD方法可以有效地提高高维惩罚回归的性能。我们的理论分析揭示了它们的性质和高维惩罚线性回归与套索的好处。大量的数值模拟和数据分析表明,该产品具有良好的性能。 摘要:Strong correlations between explanatory variables are problematic for high-dimensional regularized regression methods. Due to the violation of the Irrepresentable Condition, the popular LASSO method may suffer from false inclusions of inactive variables. In this paper, we propose pre-processing with orthogonal decompositions (PROD) for the explanatory variables in high-dimensional regressions. The PROD procedure is constructed based upon a generic orthogonal decomposition of the design matrix. We demonstrate by two concrete cases that the PROD approach can be effectively constructed for improving the performance of high-dimensional penalized regression. Our theoretical analysis reveals their properties and benefits for high-dimensional penalized linear regression with LASSO. Extensive numerical studies with simulations and data analysis show the promising performance of the PROD.

【26】 Statistical Query Lower Bounds for List-Decodable Linear Regression 标题:列表可解码线性回归的统计查询下界

作者:Ilias Diakonikolas,Daniel M. Kane,Ankit Pensia,Thanasis Pittas,Alistair Stewart 机构:University of Wisconsin-Madison, University of California, San Diego, Web , Foundation 链接:https://arxiv.org/abs/2106.09689 摘要:我们研究列表可解线性回归问题,其中对手可以破坏大多数的例子。具体地说,我们得到了一组$T$的标记示例$(x,y)inmathbb{R}^dtimesmathbb{R}$和参数$0<alpha<1/2$,使得$T$中的点的$alpha$分数是来自具有高斯协变量的线性回归模型的i.i.d.样本,剩余的$(1-alpha)$分数点是从任意噪声分布中提取的。目标是输出一个假设向量的小列表,使得其中至少有一个接近目标回归向量。我们的主要结果是这个问题的统计查询(SQ)下限为$d^{mathrm{poly}(1/alpha)}$。我们的SQ下界定性地与先前开发的算法的性能相匹配,提供了证据表明此任务的当前上界几乎是最好的。 摘要:We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples. Specifically, we are given a set $T$ of labeled examples $(x, y) in mathbb{R}^d times mathbb{R}$ and a parameter $0< alpha <1/2$ such that an $alpha$-fraction of the points in $T$ are i.i.d. samples from a linear regression model with Gaussian covariates, and the remaining $(1-alpha)$-fraction of the points are drawn from an arbitrary noise distribution. The goal is to output a small list of hypothesis vectors such that at least one of them is close to the target regression vector. Our main result is a Statistical Query (SQ) lower bound of $d^{mathrm{poly}(1/alpha)}$ for this problem. Our SQ lower bound qualitatively matches the performance of previously developed algorithms, providing evidence that current upper bounds for this task are nearly best possible.

【27】 PAC-Bayes, MAC-Bayes and Conditional Mutual Information: Fast rate bounds that handle general VC classes 标题:PAC-Bayes、MAC-Bayes和条件互信息:处理一般VC类的快速速率界限

作者:Peter Grünwald,Thomas Steinke,Lydia Zakynthinou 备注:24 pages, accepted for publication at COLT 2021 链接:https://arxiv.org/abs/2106.09683 摘要:我们给出了一个新的,统一的推导条件PAC贝叶斯和互信息(MI)推广界。我们将条件MI边界作为一个实例,通过特殊的先验选择,导出了条件MAC-Bayesian(Mean-approximal Correct)边界,它本身来自条件PAC-Bayesian边界,其中“conditional”表示可以使用以联合训练和ghost样本为条件的先验。这使我们能够为一般的VC类获得非平凡的PAC贝叶斯和MI风格的边界,这在最近的标准PAC贝叶斯/MI边界中是不可能的。第二,如果Bernstein条件成立,对于$gamma>1/2$,对于exp凹损失(使用$gamma=1$),它允许我们获得更快的$O left(({text{KL}/n)^{gamma}right)$,这在标准PAC-Bayes泛化和MI边界中都是不可能的。我们的工作扩展了Steinke和Zakynthinou[2020]最近的工作,他们用VC处理MI,但既不是PAC-Bayes也不是fast-rates,Hellstr “om和Durisi[2020]最近的工作通过统一的指数不等式将后者扩展到PAC-Bayes设置,Mhammedi等人[2019]提出了快速PAC-Bayes泛化误差界,但既不处理MI类,也不处理一般VC类。 摘要:We give a novel, unified derivation of conditional PAC-Bayesian and mutual information (MI) generalization bounds. We derive conditional MI bounds as an instance, with special choice of prior, of conditional MAC-Bayesian (Mean Approximately Correct) bounds, itself derived from conditional PAC-Bayesian bounds, where `conditional' means that one can use priors conditioned on a joint training and ghost sample. This allows us to get nontrivial PAC-Bayes and MI-style bounds for general VC classes, something recently shown to be impossible with standard PAC-Bayesian/MI bounds. Second, it allows us to get faster rates of order $O left(({text{KL}}/n)^{gamma}right)$ for $gamma > 1/2$ if a Bernstein condition holds and for exp-concave losses (with $gamma=1$), which is impossible with both standard PAC-Bayes generalization and MI bounds. Our work extends the recent work by Steinke and Zakynthinou [2020] who handle MI with VC but neither PAC-Bayes nor fast rates, the recent work of Hellstr"om and Durisi [2020] who extend the latter to the PAC-Bayes setting via a unifying exponential inequality, and Mhammedi et al. [2019] who initiated fast rate PAC-Bayes generalization error bounds but handle neither MI nor general VC classes.

【28】 A Short Note of PAGE: Optimal Convergence Rates for Nonconvex Optimization 标题:PAGE的一个简短注记:非凸优化的最优收敛速度

作者:Zhize Li 机构:KAUST 备注:4 pages 链接:https://arxiv.org/abs/2106.09663 摘要:在本文中,我们首先回顾了非凸问题设置,并介绍了最优页面算法(Li等人,ICML'21)。然后,我们提供了一个简单和干净的收敛性分析页实现最佳收敛速度。此外,PAGE及其分析方法也很容易被采用,并推广到其他作品中。希望本文能为以后的工作提供一些启示和帮助。 摘要:In this note, we first recall the nonconvex problem setting and introduce the optimal PAGE algorithm (Li et al., ICML'21). Then we provide a simple and clean convergence analysis of PAGE for achieving optimal convergence rates. Moreover, PAGE and its analysis can be easily adopted and generalized to other works. We hope that this note provides the insights and is helpful for future works.

【29】 Deep Learning Through the Lens of Example Difficulty 标题:通过示例难度的镜头进行深度学习

作者:Robert J. N. Baldock,Hartmut Maennel,Behnam Neyshabur 机构:Google Research, Brain Team, Google Research, Blueshift Team 备注:Main paper: 15 pages, 8 figures. Appendix: 31 pages, 40 figures 链接:https://arxiv.org/abs/2106.09647 摘要:现有的理解深度学习的工作通常采用将所有依赖数据的信息压缩成几个数字的方法。在这项工作中,我们采用了一种基于个体例子作用的观点。我们介绍了一个计算困难的措施作出预测的一个给定的输入:(有效)预测深度。我们的广泛调查揭示了一个给定输入的预测深度与该数据点的模型的不确定性、置信度、准确性和学习速度之间令人惊讶而又简单的关系。我们进一步将困难的例子分为三个可解释的组,展示了这些组在深层模型中的不同处理方式,并展示了这种理解如何帮助我们提高预测精度。从我们的研究中得到的见解导致了对文献中一些单独报道的现象的一致看法:早期的层概括而后期的层记忆;早期的层收敛更快,网络首先学习简单的数据和简单的函数。 摘要:Existing work on understanding deep learning often employs measures that compress all data-dependent information into a few numbers. In this work, we adopt a perspective based on the role of individual examples. We introduce a measure of the computational difficulty of making a prediction for a given input: the (effective) prediction depth. Our extensive investigation reveals surprising yet simple relationships between the prediction depth of a given input and the model's uncertainty, confidence, accuracy and speed of learning for that data point. We further categorize difficult examples into three interpretable groups, demonstrate how these groups are processed differently inside deep models and showcase how this understanding allows us to improve prediction accuracy. Insights from our study lead to a coherent view of a number of separately reported phenomena in the literature: early layers generalize while later layers memorize; early layers converge faster and networks learn easy data and simple functions first.

【30】 Meta-Calibration: Meta-Learning of Model Calibration Using Differentiable Expected Calibration Error 标题:元校准:基于可微分期望校准误差的模型校准元学习

作者:Ondrej Bohdal,Yongxin Yang,Timothy Hospedales 机构: ( 20 17) 1School of Informatics, The University of Edinburgh 链接:https://arxiv.org/abs/2106.09613 摘要:神经网络的标定是一个热点问题,对于神经网络的实际应用越来越重要。这一问题在使用现代神经网络时尤为突出,因为神经网络模型的置信度与其应有的置信度有很大的差别。已经成功地提出了各种战略,但仍有更多的改进空间。我们提出了一种新的方法,该方法引入了一个期望校准误差的可微度量,并成功地将其作为元学习的目标,通过最新的方法获得了有竞争力的结果。我们的方法提出了一个新的方向,利用元学习直接优化模型校准,我们相信这将激发进一步的工作在这一前景和新的方向。 摘要:Calibration of neural networks is a topical problem that is becoming increasingly important for real-world use of neural networks. The problem is especially noticeable when using modern neural networks, for which there is significant difference between the model confidence and the confidence it should have. Various strategies have been successfully proposed, yet there is more space for improvements. We propose a novel approach that introduces a differentiable metric for expected calibration error and successfully uses it as an objective for meta-learning, achieving competitive results with state-of-the-art approaches. Our approach presents a new direction of using meta-learning to directly optimize model calibration, which we believe will inspire further work in this promising and new direction.

【31】 Prevalence and Propagation of Fake News 标题:假新闻的流行与传播

作者:Banafsheh Behzad,Bhavana Bheem,Daniela Elizondo,Deyana Marsh,Susan Martonosi 机构:Department of Information Systems, College of Business Administration at California State, University, Long Beach, Department of Mathematics, Harvey Mudd College 备注:45 pages, 22 figures. Submitted for peer review on 7 May 2021 链接:https://arxiv.org/abs/2106.09586 摘要:近年来,学者们对不可靠新闻或“假新闻”对我们的政治领域和整个民主的影响提出了关注。例如,人们普遍认为,在社交媒体上传播假新闻影响了全国大选的结果,包括2016年美国总统大选,以及2020年的COVID-19大流行。是什么推动了虚假新闻在个人层面的传播,哪些干预措施可以有效降低传播速度?我们的模型将偏见从文章的真实性中分离出来,并检验这两个参数与读者自身信念之间的关系。利用该模型,我们为社交媒体平台和个人社交媒体用户创建政策建议,以减少不真实或高度偏见新闻的传播。我们建议平台赞助无偏见的真实新闻,将事实检查的重点放在轻度到中度偏见的新闻上,推荐跨政治领域的朋友建议,并向用户提供有关其提要政治一致性的报告。我们建议个人社交媒体用户查看与其政治偏见强烈一致的新闻,并阅读反对政治偏见的文章。 摘要:In recent years, scholars have raised concerns on the effects that unreliable news, or "fake news," has on our political sphere, and our democracy as a whole. For example, the propagation of fake news on social media is widely believed to have influenced the outcome of national elections, including the 2016 U.S. Presidential Election, and the 2020 COVID-19 pandemic. What drives the propagation of fake news on an individual level, and which interventions could effectively reduce the propagation rate? Our model disentangles bias from truthfulness of an article and examines the relationship between these two parameters and a reader's own beliefs. Using the model, we create policy recommendations for both social media platforms and individual social media users to reduce the spread of untruthful or highly biased news. We recommend that platforms sponsor unbiased truthful news, focus fact-checking efforts on mild to moderately biased news, recommend friend suggestions across the political spectrum, and provide users with reports about the political alignment of their feed. We recommend that individual social media users fact check news that strongly aligns with their political bias and read articles of opposing political bias.

【32】 Knowledge distillation from multi-modal to mono-modal segmentation networks 标题:从多模态分割网络到单模态分割网络的知识提取

作者:Minhao Hu,Matthis Maillard,Ya Zhang,Tommaso Ciceri,Giammarco La Barbera,Isabelle Bloch,Pietro Gori 机构: CMIC, Shanghai Jiao Tong University, Shanghai, China, LTCI, T´el´ecom Paris, Institut Polytechnique de Paris, France 备注:None 链接:https://arxiv.org/abs/2106.09564 摘要:近年来,多成像模式联合应用于医学图像分割得到了广泛的研究。在多个应用中,不同模态的信息融合可以提高单模态分割的精度。然而,由于医生和扫描仪的数量有限,以及成本和扫描时间的限制,在临床环境中获取多种模式通常是不可能的。大多数情况下,只获得一种模态。在本文中,我们提出了KD-Net,一个将知识从训练过的多模态网络(教师)转移到单模态网络(学生)的框架。所提出的方法是对广义蒸馏框架的一种改进,其中学生网络是在教师输入(n个模态)的子集(1个模态)上训练的。我们用BraTS 2018数据集说明了该框架在脑肿瘤分割中的有效性。通过使用不同的结构,我们证明了学生网络有效地从老师那里学习,并且在分割精度方面总是优于基线单峰网络。 摘要:The joint use of multiple imaging modalities for medical image segmentation has been widely studied in recent years. The fusion of information from different modalities has demonstrated to improve the segmentation accuracy, with respect to mono-modal segmentations, in several applications. However, acquiring multiple modalities is usually not possible in a clinical setting due to a limited number of physicians and scanners, and to limit costs and scan time. Most of the time, only one modality is acquired. In this paper, we propose KD-Net, a framework to transfer knowledge from a trained multi-modal network (teacher) to a mono-modal one (student). The proposed method is an adaptation of the generalized distillation framework where the student network is trained on a subset (1 modality) of the teacher's inputs (n modalities). We illustrate the effectiveness of the proposed framework in brain tumor segmentation with the BraTS 2018 dataset. Using different architectures, we show that the student network effectively learns from the teacher and always outperforms the baseline mono-modal network in terms of segmentation accuracy.

【33】 Author Clustering and Topic Estimation for Short Texts 标题:短文本的作者聚类与主题估计

作者:Graham Tierney,Christopher Bail,Alexander Volfovsky 链接:https://arxiv.org/abs/2106.09533 摘要:分析短文本,如社交媒体帖子,是非常困难的,因为它依赖于观察许多文档级的单词共现对。除了主题分布之外,建模的一个常见下游任务是将这些文档的作者分组,以便进行后续分析。传统的模型通过一个独立的过程来估计文档分组和识别用户集群。我们提出了一个新的模型,该模型扩展了潜在的Dirichlet分配,通过建模同一文档中单词之间的强依赖关系,并具有用户级的主题分布。我们同时对用户进行聚类,消除了事后聚类估计的需要,并通过将噪声用户级的主题分布缩小到典型值来改进主题估计。我们的方法在处理短文本中出现的问题时表现得和传统方法一样好,甚至更好,我们在美国参议员的推特数据集上证明了它的有效性,恢复了反映党派意识形态的有意义的主题和簇。 摘要:Analysis of short text, such as social media posts, is extremely difficult because it relies on observing many document-level word co-occurrence pairs. Beyond topic distributions, a common downstream task of the modeling is grouping the authors of these documents for subsequent analyses. Traditional models estimate the document groupings and identify user clusters with an independent procedure. We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document, with user-level topic distributions. We also simultaneously cluster users, removing the need for post-hoc cluster estimation and improving topic estimation by shrinking noisy user-level topic distributions towards typical values. Our method performs as well as -- or better -- than traditional approaches to problems arising in short text, and we demonstrate its usefulness on a dataset of tweets from United States Senators, recovering both meaningful topics and clusters that reflect partisan ideology.

【34】 Risk assessment for long and short range airborne transmission of SARS-CoV-2, indoors and outdoors, using carbon dioxide measurements 标题:用二氧化碳测量进行室内和室外SARS-CoV-2远距离和短距离空气传播的风险评估

作者:Florian Poydenot,Ismael Abdourahamane,Elsa Caplain,Samuel Der,Jacques Haiech,Antoine Jallon,Ines Khoutami,Amir Loucif,Emil Marinov,Bruno Andreotti 机构:|, Elsa, Caplain, Antoine, Jallon, Inés Khoutami, Emil, Marinov, Laboratoire de Physique de l’Ecole, Normale Supérieure (LPENS), CNRS UMR, Ecole Normale Supérieure, Université, PSL, Sorbonne Université, and Université, de Paris, Paris, France, Cogitamus Laboratory and CNRS UMR 备注:28 pages, 16 figures 链接:https://arxiv.org/abs/2106.09489 摘要:在学校、办公室、大学演讲厅、医院、博物馆、剧院或购物中心等公共场所的病毒传播风险的定量分析使得有可能识别出积极的卫生安全政策的有效杠杆,并评估由此获得的传输减少。SARS-CoV-2在此类公共场所的传播可在短期内降低到与流行病下降相适应的水平,即总体流行病繁殖率低于1。在这里,我们重温室内外传播风险的定量评估。结果表明,气溶胶的远距离传输受新鲜空气流量和面罩过滤质量的控制,并与CO2浓度定量相关,而与房间容积和人数无关。利用在法国两个购物中心进行的专用色散实验,对短程空中传输进行了实验研究。呼出的气溶胶被湍流气流分散在一个圆锥体中,导致浓度与平方距离和流速成反比。我们表明,平均感染剂量,称为病毒量,可以从流行病学和生物学实验数据一致地确定。实际意义。研究结果提供了一个合理的卫生政策设计,通过加强通风、空气净化、风扇的机械扩散和正确佩戴高质量口罩(手术口罩,可能覆盖织物口罩或非医用FFP2口罩)来防止病毒传播的主要途径。这些措施结合起来,通过定量评估,大大降低了SARS-CoV-2在空气中传播的风险。 摘要:The quantitative analysis of viral transmission risk in public places such as schools, offices, university lecture halls, hospitals, museums, theaters or shopping malls makes it possible to identify the effective levers for a proactive policy of health security and to evaluate the reduction in transmission thus obtained. The contribution to the epidemic propagation of SARS-CoV-2 in such public spaces can be reduced in the short term to a level compatible with an epidemic decline, i.e. with an overall epidemic reproduction rate below one. Here, we revisit the quantitative assessment of indoor and outdoor transmission risk. We show that the long range aerosol transmission is controlled by the flow rate of fresh air and by the mask filtering quality, and is quantitatively related to the CO2 concentration, regardless the room volume and the number of people. The short range airborne transmission is investigated experimentally using dedicated dispersion experiments performed in two French shopping malls. Exhaled aerosols are dispersed by turbulent draughts in a cone, leading to a concentration inversely proportional to the squared distance and to the flow velocity. We show that the average infection dose, called the viral quantum, can be consistently determined from epidemiological and biological experimental data. Practical implications. The results provide a rational design of sanitary policies to prevent the dominant routes of viral transmission by reinforced ventilation, air purification, mechanical dispersion by fans and incentives for correct wearing of quality masks (surgical mask, possibly covered by a fabric mask, or non-medical FFP2 masks). Combined, such measures significantly reduce the airborne transmission risk of SARS-CoV-2, with a quantitative assessment.

【35】 Algorithmic Bias and Data Bias: Understanding the Relation between Distributionally Robust Optimization and Data Curation 标题:算法偏差与数据偏差:理解分布式稳健优化与数据处理之间的关系

作者:Agnieszka Słowik,Léon Bottou 机构:Department of Computer Science and Technology, University of Cambridge, Cambridge, UK, Facebook AI Research, New York, NY, USA, and New York University, New York, NY, USA 链接:https://arxiv.org/abs/2106.09467 摘要:基于平均误差最小化的机器学习系统在显著的数据子集上表现出不一致性,而整个数据集的平均误差很低,不会暴露出这种不一致性。在相应的社会和经济应用中,数据代表人,这可能导致对代表性不足的性别和族裔群体的歧视。考虑到偏差缓解在机器学习中的重要性,该主题导致了关于如何确保实践中的公平性(数据偏差与算法偏差)的争论。分布式稳健优化(DRO)似乎通过最小化子种群的最坏预期风险来解决这个问题。我们建立了理论结果,阐明了DRO与在适当加权的训练数据集上平均相同损失的优化之间的关系。结果包括有限个和无限个训练分布,以及凸和非凸损失函数。我们表明,无论是DRO还是训练集的管理都不应该被解释为一个减少偏差的完整解决方案:同样的,没有一个通用的健壮的训练集,也没有一个通用的方法来设置DRO问题并确保一个社会可接受的结果集。然后,我们利用这些见解提供一套小型实用建议,以解决DRO的偏见。最后,以对抗鲁棒性为例,讨论了我们的结果在DRO其它相关应用中的影响。我们的结果表明,只要支持这些观点的论据被精确地限定并得到今天已知的相关数学的支持,偏倚辩论中以算法为中心和以数据为中心的那一方都是有价值的。 摘要:Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. In consequential social and economic applications, where data represent people, this can lead to discrimination of underrepresented gender and ethnic groups. Given the importance of bias mitigation in machine learning, the topic leads to contentious debates on how to ensure fairness in practice (data bias versus algorithmic bias). Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. The results cover finite and infinite number of training distributions, as well as convex and non-convex loss functions. We show that neither DRO nor curating the training set should be construed as a complete solution for bias mitigation: in the same way that there is no universally robust training set, there is no universal way to setup a DRO problem and ensure a socially acceptable set of results. We then leverage these insights to provide a mininal set of practical recommendations for addressing bias with DRO. Finally, we discuss ramifications of our results in other related applications of DRO, using an example of adversarial robustness. Our results show that there is merit to both the algorithm-focused and the data-focused side of the bias debate, as long as arguments in favor of these positions are precisely qualified and backed by relevant mathematics known today.

【36】 Deep generative modeling for probabilistic forecasting in power systems 标题:电力系统概率预测的深度产生式建模

作者:Jonathan Dumas,Antoine Wehenkel Damien Lanaspeze,Bertrand Cornélusse,Antonio Sutera 机构:Liege University, Departments of Computer Science and Electrical Engineering, Belgium, Mines ParisTech, France 链接:https://arxiv.org/abs/2106.09370 摘要:对可再生能源比例较高的终端使用部门进行更大程度的直接电气化,是到2050年实现碳中和社会的支柱之一。本研究采用了一种最新的深度学习技术,即正常化流程,为决策者提供准确的概率预测,以应对电力系统应用中的新挑战。通过使用2014年全球能源预测竞赛的开放数据进行综合实证评估,我们证明了我们的方法与其他最先进的深度学习生成模型(生成对抗网络和变分自动编码器)具有竞争力。通过考虑能源零售商的案例研究,对基于天气的风力、太阳能和负荷情景的模型在预测值和质量方面进行了适当的比较,并使用了若干补充指标。 摘要:Greater direct electrification of end-use sectors with a higher share of renewables is one of the pillars to power a carbon-neutral society by 2050. This study uses a recent deep learning technique, the normalizing flows, to produce accurate probabilistic forecasts that are crucial for decision-makers to face the new challenges in power systems applications. Through comprehensive empirical evaluations using the open data of the Global Energy Forecasting Competition 2014, we demonstrate that our methodology is competitive with other state-of-the-art deep learning generative models: generative adversarial networks and variational autoencoders. The models producing weather-based wind, solar power, and load scenarios are properly compared both in terms of forecast value, by considering the case study of an energy retailer, and quality using several complementary metrics.

【37】 Towards sampling complex actions 标题:朝着对复杂动作进行抽样的方向发展

作者:Lukas Kades,Martin Gärttner,Thomas Gasenzer,Jan M. Pawlowksi 机构:Institut für Theoretische Physik, Ruprecht-Karls-Universität Heidelberg, Philosophenweg , Heidelberg, Germany, Physikalisches Institut, Universität Heidelberg, Im Neuenheimer Feld , Heidelberg, Germany 备注:29 pages, 9 figures 链接:https://arxiv.org/abs/2106.09367 摘要:从自旋或质量不平衡的原子气体和石墨烯到有限密度下的量子色动力学,再到量子系统的非平衡演化,许多物理系统都会遇到具有复杂作用的路径积分。许多计算方法已被开发用于解决复杂动作中出现的符号问题。其中,复杂朗之万动力学具有普遍适用性。它的一个关键挑战是动力学可能收敛到非物理不动点。在这样一个固定点的统计抽样过程不是基于物理作用,因此会导致错误的预测。此外,由于过程的隐含性,它的非物理性质很难被发现。本文在扩展状态空间中建立了一种基于马尔可夫链蒙特卡罗方法的通用方法。在这种方法中,我们导出了广义复Langevin动力学的显式实采样过程。受一系列约束的影响,这个抽样过程是物理过程。这些约束源于montecarlo格式所满足的详细平衡方程。这使我们能够从一个新的角度重新推导复杂的朗之万动力学,并为复杂行为的新抽样方案的显式构造建立了一个框架。 摘要:Path integrals with complex actions are encountered for many physical systems ranging from spin- or mass-imbalanced atomic gases and graphene to quantum chromo-dynamics at finite density to the non-equilibrium evolution of quantum systems. Many computational approaches have been developed for tackling the sign problem emerging for complex actions. Among these, complex Langevin dynamics has the appeal of general applicability. One of its key challenges is the potential convergence of the dynamics to unphysical fixed points. The statistical sampling process at such a fixed point is not based on the physical action and hence leads to wrong predictions. Moreover, its unphysical nature is hard to detect due to the implicit nature of the process. In the present work we set up a general approach based on a Markov chain Monte Carlo scheme in an extended state space. In this approach we derive an explicit real sampling process for generalized complex Langevin dynamics. Subject to a set of constraints, this sampling process is the physical one. These constraints originate from the detailed-balance equations satisfied by the Monte Carlo scheme. This allows us to re-derive complex Langevin dynamics from a new perspective and establishes a framework for the explicit construction of new sampling schemes for complex actions.

【38】 Identifiability of AMP chain graph models 标题:AMP链图模型的可辨识性

作者:Yuhao Wang,Arnab Bhattacharyya 机构:com†National University of Singapore 备注:16 pages, 4 figures 链接:https://arxiv.org/abs/2106.09350 摘要:本文研究了Andersson-Madigan-Perlman(AMP)链图模型的可辨识性,这是线性结构方程模型和高斯图模型的一种常见推广。AMP模型由链组件上的DAG描述,链组件本身是无向图。对于已知的链分量分解,我们证明了如果链分量的剩余协方差矩阵的行列式在拓扑序上是单调非减的,则链分量上的DAG是可识别的。该条件推广了Bayes网的等方差可辨识性准则,并将其从行列式推广到半正定矩阵上的任意超可加函数。当组件分解未知时,我们使用基于子模函数最小化的多项式时间算法来描述允许恢复完整结构的条件。我们还进行了实验,比较了我们的算法与现有基线的性能。 摘要:We study identifiability of Andersson-Madigan-Perlman (AMP) chain graph models, which are a common generalization of linear structural equation models and Gaussian graphical models. AMP models are described by DAGs on chain components which themselves are undirected graphs. For a known chain component decomposition, we show that the DAG on the chain components is identifiable if the determinants of the residual covariance matrices of the chain components are monotone non-decreasing in topological order. This condition extends the equal variance identifiability criterion for Bayes nets, and it can be generalized from determinants to any super-additive function on positive semidefinite matrices. When the component decomposition is unknown, we describe conditions that allow recovery of the full structure using a polynomial time algorithm based on submodular function minimization. We also conduct experiments comparing our algorithm's performance against existing baselines.

【39】 Minimax Estimation of Partially-Observed Vector AutoRegressions 标题:部分观测向量自回归的极小极大估计

作者:Guillaume Dalle,Yohann de Castro 机构:CERMICS, École des Ponts, Marne-la-Vallée, France, Institut Camille Jordan, École Centrale Lyon, Écully, France 链接:https://arxiv.org/abs/2106.09327 摘要:为了理解像运输网络这样的大型动态系统的行为,人们必须经常依赖由一组传感器传输的测量数据,例如单个车辆。这样的测量可能是不完整的和不精确的,这使得很难恢复潜在的感兴趣的信号。为了量化这种现象,我们研究了部分观测状态空间模型的性质。在我们的设置中,潜在状态$X$遵循高维向量自回归过程$Xu t=theta Xu{t-1} varepsilonu t$。同时,观测值$Y$由来自状态$Yu t=Piu t Xu t etau t$的噪声破坏随机样本给出。研究了几种随机抽样机制,考察了空间和时间相关性对抽样矩阵$Piu t$分布的影响,首先证明了转移矩阵$theta$的极大极小估计误差的下界。然后,我们描述了一种基于Dantzig选择器的稀疏估计及其非渐近误差上界,证明了它在大多数采样机制下都达到了最优收敛速度。模拟时间序列的数值实验验证了我们的理论结果,而开放铁路数据的应用突出了该模型对公共交通分析的相关性。 摘要:To understand the behavior of large dynamical systems like transportation networks, one must often rely on measurements transmitted by a set of sensors, for instance individual vehicles. Such measurements are likely to be incomplete and imprecise, which makes it hard to recover the underlying signal of interest.Hoping to quantify this phenomenon, we study the properties of a partially-observed state-space model. In our setting, the latent state $X$ follows a high-dimensional Vector AutoRegressive process $X_t = theta X_{t-1} varepsilon_t$. Meanwhile, the observations $Y$ are given by a noise-corrupted random sample from the state $Y_t = Pi_t X_t eta_t$. Several random sampling mechanisms are studied, allowing us to investigate the effect of spatial and temporal correlations in the distribution of the sampling matrices $Pi_t$.We first prove a lower bound on the minimax estimation error for the transition matrix $theta$. We then describe a sparse estimator based on the Dantzig selector and upper bound its non-asymptotic error, showing that it achieves the optimal convergence rate for most of our sampling mechanisms. Numerical experiments on simulated time series validate our theoretical findings, while an application to open railway data highlights the relevance of this model for public transport traffic analysis.

【40】 Towards Understanding Deep Learning from Noisy Labels with Small-Loss Criterion 标题:基于小损失准则的噪声标签深度学习理解

作者:Xian-Jin Gui,Wei Wang,Zhang-Hao Tian 机构:National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing , China 备注:Accepted to International Joint Conference on Artificial Intelligence (IJCAI) 2021, includes non-archival supplementary material 链接:https://arxiv.org/abs/2106.09291 摘要:深度神经网络需要大量的标记数据才能获得良好的性能。在实际应用中,标签通常是从非专家(如众包)处收集的,以节省成本,因此很吵。在过去的几年里,人们发展了一种处理含噪标签的深度学习方法,其中许多方法是基于小损失准则的。然而,很少有理论分析来解释为什么这些方法能从有噪声的标签中很好地学习。本文从理论上解释了广泛应用的小损耗准则的工作原理。在此基础上,我们对vanilla小损失准则进行了改进,以更好地解决标签噪声问题。实验结果验证了我们的理论解释,也证明了改进的有效性。 摘要:Deep neural networks need large amounts of labeled data to achieve good performance. In real-world applications, labels are usually collected from non-experts such as crowdsourcing to save cost and thus are noisy. In the past few years, deep learning methods for dealing with noisy labels have been developed, many of which are based on the small-loss criterion. However, there are few theoretical analyses to explain why these methods could learn well from noisy labels. In this paper, we theoretically explain why the widely-used small-loss criterion works. Based on the explanation, we reformalize the vanilla small-loss criterion to better tackle noisy labels. The experimental results verify our theoretical explanation and also demonstrate the effectiveness of the reformalization.

【41】 Pruning Randomly Initialized Neural Networks with Iterative Randomization 标题:用迭代随机化方法修剪随机初始化的神经网络

作者:Daiki Chijiwa,Shin'ya Yamaguchi,Yasutoshi Ida,Kenji Umakoshi,Tomohiro Inoue 机构:Shin’ya Yamaguchi, NTT Software Innovation Center, NTT Corporation 备注:Code will be available at this https URL 链接:https://arxiv.org/abs/2106.09269 摘要:随机初始化神经网络权值的剪枝在彩票假设中起着重要的作用。Ramanujan等人(2020)的经验表明,只有删减权重才能获得显著的性能,而不是优化权重值。然而,为了达到与权值优化相同的性能水平,剪枝方法在剪枝之前需要网络中的更多参数,从而需要更多的内存空间。为了克服这个参数无效的问题,我们引入了一个新的框架,用迭代随机权值(IteRand)修剪随机初始化的神经网络。理论上,我们在我们的框架中证明了一个逼近定理,这表明随机化操作可以有效地减少所需的参数数目。在CIFAR-10和ImageNet上进行了多次实验,验证了参数的有效性。 摘要:Pruning the weights of randomly initialized neural networks plays an important role in the context of lottery ticket hypothesis. Ramanujan et al. (2020) empirically showed that only pruning the weights can achieve remarkable performance instead of optimizing the weight values. However, to achieve the same level of performance as the weight optimization, the pruning approach requires more parameters in the networks before pruning and thus more memory space. To overcome this parameter inefficiency, we introduce a novel framework to prune randomly initialized neural networks with iteratively randomizing weight values (IteRand). Theoretically, we prove an approximation theorem in our framework, which indicates that the randomizing operations are provably effective to reduce the required number of the parameters. We also empirically demonstrate the parameter efficiency in multiple experiments on CIFAR-10 and ImageNet.

【42】 Federated CycleGAN for Privacy-Preserving Image-to-Image Translation 标题:保护隐私的图像到图像转换的联邦循环GAN

作者:Joonyoung Song,Jong Chul Ye 机构:Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea 链接:https://arxiv.org/abs/2106.09246 摘要:无监督图像到图像的转换方法,如CycleGAN学习使用来自不同域的未配对训练数据集将图像从一个域转换到另一个域。不幸的是,这些方法仍然需要集中收集未配对的记录,这可能会侵犯隐私和安全问题。虽然最近的联邦学习(FL)允许神经网络在不交换数据的情况下进行训练,但FL的基本假设是所有客户都有自己的来自相似领域的训练数据,这与我们的图像到图像转换场景不同,在这种场景中,每个客户端都有来自其唯一域的图像,目标是在不访问目标域数据的情况下学习不同域之间的图像转换。为了解决这一问题,本文提出了一种新的联邦CycleGAN体系结构,它可以在保持数据隐私的同时,以无监督的方式学习图像翻译。具体来说,我们的方法源自一个新的观察,即CycleGAN损失可以分解为客户特定的局部目标的总和,这些目标可以仅使用其数据进行评估。这种局部目标分解允许多个客户机在不牺牲性能的情况下参与联合CycleGAN训练。此外,我们的方法采用了新的可切换生成器和鉴别器架构,使用自适应实例规范化(AdaIN)显著降低了联邦学习的带宽要求。我们在各种无监督图像翻译任务上的实验结果表明,我们的联邦CycleGAN与非联邦CycleGAN具有相当的性能。 摘要:Unsupervised image-to-image translation methods such as CycleGAN learn to convert images from one domain to another using unpaired training data sets from different domains. Unfortunately, these approaches still require centrally collected unpaired records, potentially violating privacy and security issues. Although the recent federated learning (FL) allows a neural network to be trained without data exchange, the basic assumption of the FL is that all clients have their own training data from a similar domain, which is different from our image-to-image translation scenario in which each client has images from its unique domain and the goal is to learn image translation between different domains without accessing the target domain data. To address this, here we propose a novel federated CycleGAN architecture that can learn image translation in an unsupervised manner while maintaining the data privacy. Specifically, our approach arises from a novel observation that CycleGAN loss can be decomposed into the sum of client specific local objectives that can be evaluated using only their data. This local objective decomposition allows multiple clients to participate in federated CycleGAN training without sacrificing performance. Furthermore, our method employs novel switchable generator and discriminator architecture using Adaptive Instance Normalization (AdaIN) that significantly reduces the band-width requirement of the federated learning. Our experimental results on various unsupervised image translation tasks show that our federated CycleGAN provides comparable performance compared to the non-federated counterpart.

【43】 Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning 标题:为什么预先训练的语言模型有助于下游任务?试析中心语和提示语的调音

作者:Colin Wei,Sang Michael Xie,Tengyu Ma 机构:Stanford University, Department of Computer Science 链接:https://arxiv.org/abs/2106.09226 摘要:经过预训练的语言模型在适应下游NLP任务时取得了最先进的性能。然而,这些模型的理论分析是稀缺的和具有挑战性的,因为预训练和下游任务可能会有很大的不同。我们提出了一个分析框架,将训练前和下游任务与潜在变量文本生成模型联系起来——下游分类器必须恢复潜在变量的后验分布函数。我们分析了这种设置下的头部调整(在冻结的预训练模型上学习分类器)和提示调整。在我们的分析中,生成模型要么是一个隐马尔可夫模型(HMM),要么是一个带有潜在记忆成分的HMM,其动机是自然语言中的长期依赖性。我们证明了:1)在HMM的某些非简并条件下,简单的分类头可以解决下游任务;2)快速调整可以在较弱的非简并条件下获得下游保证,3)由于任务相关信息更容易从长时记忆中恢复,因此增强记忆HMM的恢复保证要比普通HMM强。对HMMs合成数据的实验支持了我们的理论发现。 摘要:Pretrained language models have achieved state-of-the-art performance when adapted to a downstream NLP task. However, theoretical analysis of these models is scarce and challenging since the pretraining and downstream tasks can be very different. We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text -- the downstream classifier must recover a function of the posterior distribution over the latent variables. We analyze head tuning (learning a classifier on top of the frozen pretrained model) and prompt tuning in this setting. The generative model in our analysis is either a Hidden Markov Model (HMM) or an HMM augmented with a latent memory component, motivated by long-term dependencies in natural language. We show that 1) under certain non-degeneracy conditions on the HMM, simple classification heads can solve the downstream task, 2) prompt tuning obtains downstream guarantees with weaker non-degeneracy conditions, and 3) our recovery guarantees for the memory-augmented HMM are stronger than for the vanilla HMM because task-relevant information is easier to recover from the long-term memory. Experiments on synthetically generated data from HMMs back our theoretical findings.

【44】 On the Power of Preconditioning in Sparse Linear Regression 标题:关于稀疏线性回归中预处理的作用

作者:Jonathan Kelner,Frederic Koehler,Raghu Meka,Dhruv Rohatgi 机构:MIT, UCLA 备注:73 pages, 5 figures 链接:https://arxiv.org/abs/2106.09207 摘要:稀疏线性回归是高维统计中的一个基本问题,但对于如何在不限制设计矩阵的条件下有效地求解它却知之甚少。我们考虑(相关)随机设计设置,其中协变量独立地从多元高斯$N(0,,Sigma)$和$Sigma:Ntimes N$中提取,并且寻求估计量$hat{w}$最小化$(hat{w}-w^*)^TSigma(hat{w}-w^*)$,其中,$w^*$是$k$-稀疏的基本真值。从理论上讲,对于任意的$Sigma$和$$w^*$,我们可以用$O(klogn)$样本获得很强的误差界;然而,在没有进一步假设$Sigma$或$w^*$的情况下,即使使用$o(n)$样本,也没有有效的算法能够匹配这些保证。至于硬度,计算下限只知道最坏情况下的设计矩阵。随机设计实例是已知的,对于套索来说是困难的,但是这些实例通常可以通过简单地改变基(即预处理)后由套索求解。在这项工作中,我们给出了上界和下界澄清权力的预处理稀疏线性回归。首先,我们证明了预处理套索可以近似最优地解决一大类稀疏线性回归问题:只要协变量的依赖结构(在马尔可夫性质的意义上)具有较低的树宽,它就会成功——即使$Sigma$是高度病态的。第二,我们构造(第一次)随机设计实例,这对于一个最优预处理套索是很难证明的。事实上,我们通过证明对于任何树宽-$t$图,在这个图上存在一个高斯马尔可夫随机场来完成我们的树宽分类,这样,当从这个模型中提取协变量时,预处理的套索,在任何预处理子的选择下,都需要$Omega(t^{1/20})$样本来恢复$O(logn)$-稀疏信号。 摘要:Sparse linear regression is a fundamental problem in high-dimensional statistics, but strikingly little is known about how to efficiently solve it without restrictive conditions on the design matrix. We consider the (correlated) random design setting, where the covariates are independently drawn from a multivariate Gaussian $N(0,Sigma)$ with $Sigma : n times n$, and seek estimators $hat{w}$ minimizing $(hat{w}-w^*)^TSigma(hat{w}-w^*)$, where $w^*$ is the $k$-sparse ground truth. Information theoretically, one can achieve strong error bounds with $O(k log n)$ samples for arbitrary $Sigma$ and $w^*$; however, no efficient algorithms are known to match these guarantees even with $o(n)$ samples, without further assumptions on $Sigma$ or $w^*$. As far as hardness, computational lower bounds are only known with worst-case design matrices. Random-design instances are known which are hard for the Lasso, but these instances can generally be solved by Lasso after a simple change-of-basis (i.e. preconditioning). In this work, we give upper and lower bounds clarifying the power of preconditioning in sparse linear regression. First, we show that the preconditioned Lasso can solve a large class of sparse linear regression problems nearly optimally: it succeeds whenever the dependency structure of the covariates, in the sense of the Markov property, has low treewidth -- even if $Sigma$ is highly ill-conditioned. Second, we construct (for the first time) random-design instances which are provably hard for an optimally preconditioned Lasso. In fact, we complete our treewidth classification by proving that for any treewidth-$t$ graph, there exists a Gaussian Markov Random Field on this graph such that the preconditioned Lasso, with any choice of preconditioner, requires $Omega(t^{1/20})$ samples to recover $O(log n)$-sparse signals when covariates are drawn from this model.

【45】 Amortized Auto-Tuning: Cost-Efficient Transfer Optimization for Hyperparameter Recommendation 标题:摊余自动调谐:超参数推荐的低成本传输优化

作者:Yuxin Xiao,Eric P. Xing,Willie Neiswanger 机构:Carnegie Mellon University,Stanford University,Petuum,MBZUAI 链接:https://arxiv.org/abs/2106.09179 摘要:随着现代机器学习模型的超参数数目和训练次数的激增,超参数整定的代价越来越高。尽管已经提出了通过知识转移来加速调谐的方法,但是它们通常需要超参数的最终性能,并且不关注低保真度信息。然而,这种常见的做法是次优的,可能导致不必要的资源使用。相反,利用低保真度的调整观察值来度量任务间的相似性,并相应地将知识从现有任务转移到新任务,更具成本效益。然而,在传输设置中执行多保真度调谐有其自身的挑战:附加观测中的噪声和性能预测的需要。因此,我们对多任务多保真度贝叶斯优化框架进行了深入的分析,得到了最佳的实例——摊余自动调整(AT2)。我们进一步提出了一个离线计算的27任务超参数推荐(HyperRec)数据库来服务于社区。在HyperRec和其他真实数据库上的大量实验说明了我们的AT2方法的有效性。 摘要:With the surge in the number of hyperparameters and training times of modern machine learning models, hyperparameter tuning is becoming increasingly expensive. Although methods have been proposed to speed up tuning via knowledge transfer, they typically require the final performance of hyperparameters and do not focus on low-fidelity information. Nevertheless, this common practice is suboptimal and can incur an unnecessary use of resources. It is more cost-efficient to instead leverage the low-fidelity tuning observations to measure inter-task similarity and transfer knowledge from existing to new tasks accordingly. However, performing multi-fidelity tuning comes with its own challenges in the transfer setting: the noise in the additional observations and the need for performance forecasting. Therefore, we conduct a thorough analysis of the multi-task multi-fidelity Bayesian optimization framework, which leads to the best instantiation--amortized auto-tuning (AT2). We further present an offline-computed 27-task hyperparameter recommendation (HyperRec) database to serve the community. Extensive experiments on HyperRec and other real-world databases illustrate the effectiveness of our AT2 method.

【46】 An Imprecise SHAP as a Tool for Explaining the Class Probability Distributions under Limited Training Data 标题:有限训练数据下类概率分布的不精确形状解释工具

作者:Lev V. Utkin,Andrei V. Konstantinov,Kirill A. Vishniakov 机构:Peter the Great St.Petersburg Polytechnic University, St.Petersburg, Russia 链接:https://arxiv.org/abs/2106.09111 摘要:最流行的机器学习预测解释方法之一是SHapley加性解释方法(SHAP)。对于类概率分布不精确且用分布集表示的情况,提出了一种不精确形状作为原始形状的修正。不精确形状背后的第一个想法是一种计算特征边际贡献的新方法,它满足了Shapley值的重要效率特性。第二个想法是尝试考虑计算和减少区间值Shapley值的一般方法,这类似于不精确概率论中的可达概率区间的想法。基于Kolmogorov-Smirnov距离和不精确污染模型,提出了一种线性优化问题形式的通用方法的简单特殊实现。用合成数据和实际数据的数值例子说明了这种不精确的形状。 摘要:One of the most popular methods of the machine learning prediction explanation is the SHapley Additive exPlanations method (SHAP). An imprecise SHAP as a modification of the original SHAP is proposed for cases when the class probability distributions are imprecise and represented by sets of distributions. The first idea behind the imprecise SHAP is a new approach for computing the marginal contribution of a feature, which fulfils the important efficiency property of Shapley values. The second idea is an attempt to consider a general approach to calculating and reducing interval-valued Shapley values, which is similar to the idea of reachable probability intervals in the imprecise probability theory. A simple special implementation of the general approach in the form of linear optimization problems is proposed, which is based on using the Kolmogorov-Smirnov distance and imprecise contamination models. Numerical examples with synthetic and real data illustrate the imprecise SHAP.

【47】 Identifiability-Guaranteed Simplex-Structured Post-Nonlinear Mixture Learning via Autoencoder 标题:基于自动编码器的保证可辨识性的单纯形结构后非线性混合学习

作者:Qi Lyu,Xiao Fu 机构:School of Electrical Engineering and Computer Science, Oregon State University 链接:https://arxiv.org/abs/2106.09070 摘要:这项工作的重点是在无监督的方式解开非线性混合潜在成分的问题。假设潜在成分位于概率单纯形中,并由一个未知的后非线性混合系统进行变换。该问题在信号和数据分析中有着广泛的应用,如非线性高光谱分解、图像嵌入和非线性聚类等。线性混合学习问题已经是病态的,因为目标潜在成分的可辨识性通常很难建立。由于涉及到未知的非线性,这个问题更具挑战性。先前的工作提供了一个基于函数方程的公式,用于可证明的潜在成分识别。然而,可识别性条件有些苛刻和不现实。此外,可辨识性分析是基于无限样本(即总体)的情形,而对于实际有限样本情形的理解一直是难以捉摸的。此外,以往的算法在模型表达性和计算方便性之间进行权衡,这往往会影响学习性能。我们的贡献是三倍的。首先,在很大程度上放松的假设下导出了新的可辨识条件。其次,给出了全面的样本复杂性结果——这是第一次。第三,提出了一种基于约束自动编码器的算法框架,有效地规避了现有算法的挑战。合成和真实的实验证实了我们的理论分析。 摘要:This work focuses on the problem of unraveling nonlinearly mixed latent components in an unsupervised manner. The latent components are assumed to reside in the probability simplex, and are transformed by an unknown post-nonlinear mixing system. This problem finds various applications in signal and data analytics, e.g., nonlinear hyperspectral unmixing, image embedding, and nonlinear clustering. Linear mixture learning problems are already ill-posed, as identifiability of the target latent components is hard to establish in general. With unknown nonlinearity involved, the problem is even more challenging. Prior work offered a function equation-based formulation for provable latent component identification. However, the identifiability conditions are somewhat stringent and unrealistic. In addition, the identifiability analysis is based on the infinite sample (i.e., population) case, while the understanding for practical finite sample cases has been elusive. Moreover, the algorithm in the prior work trades model expressiveness with computational convenience, which often hinders the learning performance. Our contribution is threefold. First, new identifiability conditions are derived under largely relaxed assumptions. Second, comprehensive sample complexity results are presented -- which are the first of the kind. Third, a constrained autoencoder-based algorithmic framework is proposed for implementation, which effectively circumvents the challenges in the existing algorithm. Synthetic and real experiments corroborate our theoretical analyses.

【48】 Exponential Error Convergence in Data Classification with Optimized Random Features: Acceleration by Quantum Machine Learning 标题:优化随机特征数据分类的指数误差收敛:量子机器学习加速

作者:Hayata Yamasaki,Sho Sonoda 机构:Austrian Academy of Sciences, Vienna, Austria, Vienna University of Technology, Vienna, Austria, RIKEN AIP, Tokyo, Japan 备注:28 pages, no figure 链接:https://arxiv.org/abs/2106.09028 摘要:随机特征是基于核方法的可伸缩学习算法的核心技术。最近的一项研究表明,量子计算机机器学习算法QML(quantum machine learning)可以指数级地加快优化随机特征的采样速度,即使没有对矩阵的稀疏性和低rankness施加限制性的假设,这限制了传统QML算法的适用性;这种QML算法可以显著减少回归任务所需的特征数,并可证明最小化特征数。然而,QML领域的一个主要兴趣是量子计算的优势可以得到多大程度的利用,而不仅仅是在回归任务中。在这里,我们构造了一个QML算法,用于由优化的随机特征加速的分类任务。我们证明了在低噪声条件下,采样优化随机特征的QML算法与随机梯度下降(SGD)相结合,可以在减少分类错误的情况下达到最新的指数收敛速度;同时,我们的随机特征优化算法可以利用所需特征数的显著减少来加速SGD中的每次迭代和对算法得到的分类器的评估。这些结果发现QML在显著加速基于核方法的领先分类算法方面有很好的应用前景,同时不影响其对实际数据集的适用性和指数误差收敛速度。 摘要:Random features are a central technique for scalable learning algorithms based on kernel methods. A recent work has shown that an algorithm for machine learning by quantum computer, quantum machine learning (QML), can exponentially speed up sampling of optimized random features, even without imposing restrictive assumptions on sparsity and low-rankness of matrices that had limited applicability of conventional QML algorithms; this QML algorithm makes it possible to significantly reduce and provably minimize the required number of features for regression tasks. However, a major interest in the field of QML is how widely the advantages of quantum computation can be exploited, not only in the regression tasks. We here construct a QML algorithm for a classification task accelerated by the optimized random features. We prove that the QML algorithm for sampling optimized random features, combined with stochastic gradient descent (SGD), can achieve state-of-the-art exponential convergence speed of reducing classification error in a classification task under a low-noise condition; at the same time, our algorithm with optimized random features can take advantage of the significant reduction of the required number of features so as to accelerate each iteration in the SGD and evaluation of the classifier obtained from our algorithm. These results discover a promising application of QML to significant acceleration of the leading classification algorithm based on kernel methods, without ruining its applicability to a practical class of data sets and the exponential error-convergence speed.

0 人点赞